Ben Fisher

Apr 28, 2015

Aggregating ICEWS Data in R

This is going to be a walkthrough of an R function I've written, which can be found here that aggregates the recently (finally!) released ICEWS data. aggregate_icews converts the data into country-month observations with columns counting events internal to a country by type and actor combination. For example, the gov_reb_matcf variable would be a count of material conflict events where the government was the source and a rebel group was the target. The event types are based on the standard quad categories: verbal cooperation, material cooperation, verbal conflict, and material conflict. The different actor types are based off groupings of agent types from the text_to_CAMEO program. The groupings are:

  • GOV - government actors, grouped from GOV, MIL, JUD
  • REB - rebel actors, grouped from REB, INS, IMG
  • OPP - political opposition, from just OPP
  • CVL - civil society, grouped from EDU, MED, HLH, CVL, BUS
  • IOS - international organizations, grouped from IGO, NGO
  • USA - United States

Now, this is just how I've decided to group them. There's obviously a lot of ways this could be done, and it only takes adding/deleting a line or two of code in the function to add or change a grouping.

Since the format that the ICEWS data comes in is not particularly easy to use, you'll need to first run Phil Schrodt's text_to_CAMEO program. If you don't have any experience with Python, you can run the program from R using the rPython package. Just make sure that the text_to_CAMEO and ICEWS files are all in your working directory.

1
2
3
library(rPython)
setwd('directory/with/files/')
python.load('text_to_CAMEO.py') # runs the python program, it may take a minute or two

This will generate a tab-delimited text file for each year of ICEWS data. The aggregation function takes a data frame as its main argument, so you'll want to read the text files into a data frame.

1
2
3
files = list.files(pattern='reduced.ICEWS.events.*.txt')
icews.data = do.call('rbind', lapply(files, function(x) read.table(x, header=FALSE,
    sep='\t')))

Now that we have the data ready to aggregate, I'm going to walk through the aggregate_icews function so that you can get an idea of how it works. I expect most people will end up modifying the function in some way for different aggregations.

In the first section, I create a vector of actor groupings, var_actors, a list of the quad categories and their corresponding numerical codes, var_types, and a vector of labels for the variables I'm going to create, variables. It's important to note that the function will only create a variable if it is listed in the variables vector. So, if you wanted to only get counts for dyads involving a government actor, then you would remove any label in variables that does not include 'gov'.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
aggregate_icews = function(df){
    require(lubridate)
    require(dplyr)
    require(reshape2)

    var_actors = c('gov','opp','soc','ios','usa','reb')
    var_types = list(vercp = 1, matcp = 2, vercf = 3, matcf = 4)
    variables = c('gov_gov_vercp', 'gov_gov_matcp', 'gov_gov_vercf', 'gov_gov_matcf',
             'gov_gov_gold', 'gov_opp_vercp', 'gov_opp_matcp', 'gov_opp_vercf',
             'gov_opp_matcf', 'opp_gov_vercp', 'opp_gov_matcp', 'opp_gov_vercf',
             'opp_gov_matcf', 'opp_gov_gold', 'gov_reb_vercp', 'gov_reb_matcp',
             'gov_reb_vercf', 'gov_reb_matcf', 'gov_reb_gold', 'reb_gov_vercp',
             'reb_gov_matcp', 'reb_gov_vercf', 'reb_gov_matcf', 'reb_gov_gold',
             'gov_soc_vercp', 'gov_soc_matcp', 'gov_soc_vercf', 'gov_soc_matcf',
             'gov_soc_gold', 'soc_gov_vercp', 'soc_gov_matcp', 'soc_gov_vercf',
             'soc_gov_matcf', 'soc_gov_gold', 'gov_ios_vercp', 'gov_ios_matcp', 
             'gov_ios_vercf', 'gov_ios_matcf', 'gov_ios_gold', 'ios_gov_vercp',
             'ios_gov_matcp', 'ios_gov_vercf', 'ios_gov_matcf', 'ios_gov_gold',
             'gov_usa_vercp', 'gov_usa_matcp', 'gov_usa_vercf', 'gov_usa_matcf',
             'gov_usa_gold', 'usa_gov_vercp', 'usa_gov_matcp', 'usa_gov_vercf', 
             'usa_gov_matcf', 'usa_gov_gold', 'opp_reb_vercp', 'opp_reb_matcp', 
             'opp_reb_vercf', 'opp_reb_matcf', 'opp_reb_gold', 'reb_opp_vercp', 
             'reb_opp_matcp', 'reb_opp_vercf', 'reb_opp_matcf', 'reb_opp_gold', 
             'opp_opp_vercp', 'opp_opp_matcp', 'opp_opp_vercf', 'opp_opp_matcf',
             'opp_opp_gold', 'reb_reb_vercp', 'reb_reb_matcp', 'reb_reb_vercf', 
             'reb_reb_matcf', 'reb_reb_gold', 'opp_soc_vercp', 'opp_soc_matcp', 
             'opp_soc_vercf', 'opp_soc_matcf', 'opp_soc_gold', 'soc_opp_vercp', 
             'soc_opp_matcp', 'soc_opp_vercf', 'sco_opp_matcf', 'soc_opp_gold',
             'opp_ios_vercp', 'opp_ios_matcp', 'opp_ios_vercf', 'opp_ios_matcf',
             'opp_ios_gold', 'ios_opp_vercp', 'ios_opp_matcp', 'ios_opp_vercf', 
             'ios_opp_matcf', 'ios_opp_gold', 'opp_usa_vercp', 'opp_usa_matcp', 
             'opp_usa_vercf', 'opp_usa_matcf', 'opp_usa_gold', 'usa_opp_vercp', 
             'usa_opp_matcp', 'usa_opp_vercf', 'usa_opp_matcf', 'usa_opp_gold',
             'soc_ios_vercp', 'soc_ios_matcp', 'soc_ios_vercf', 'soc_ios_matcf',
             'soc_ios_gold', 'ios_soc_vercp', 'ios_soc_matcp', 'ios_soc_vercf', 
             'ios_soc_matcf', 'ios_soc_gold', 'soc_usa_vercp', 'soc_usa_matcp', 
             'soc_usa_vercf', 'soc_usa_matcf', 'soc_usa_gold', 'gov_gov_vercp', 
             'usa_soc_matcp', 'usa_soc_vercf', 'usa_soc_matcf', 'usa_soc_gold',
             'soc_soc_vercp', 'soc_soc_matcp', 'soc_soc_vercf', 'soc_soc_matcf',
             'soc_soc_gold')

Let's say we want to get counts for a specific event, rather than the bigger quad categories. Using the 'arrest/detain' category, we would alter var_types to look something like

1
var_types = list(arrest = 173)

where 'arrest' is what we're going to call that variable type and 173 is the CAMEO code corresponding to arrest/detain. We would then edit the labels in variables accordingly (e.g. 'gov_opp_arrest','gov_soc_arrest', etc.).

In the next section, I filter the data frame so that I only have within-country observations as well as those where the US was either the source are the target. all_actors is a vector of countries that we are going to generate counts for. Lines 5 through 10 create a few new variables to make aggregation easier and get rid of those we no longer need. If you want to aggregate by day rather than month, just generate a day column using lubridate's mday function.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
    colnames(df) = c('date','iso1','cow1','agent1','iso2','cow2','agent2','cameo','goldstein','quad')
    df = filter(df, as.character(df$iso1) == as.character(df$iso2) | df$iso1 == 'USA' | df$iso2 == 'USA')
    all_actors = as.vector(unique(df$iso1))
    all_actors = all_actors[all_actors != 'USA' & all_actors != '---' & nchar(all_actors) == 3]
    df$Actor1Code = paste(df$iso1, df$agent1, sep='')
    df$Actor2Code = paste(df$iso2, df$agent2, sep='')
    df$date = as.Date(df$date)
    df$year = year(df$date)
    df$month = month(df$date)
    df = df[c(13,14,11,12,10,9)]

The next section creates actor type groupings based on the agent codes. Lines 10-11, for example, assign the 'GOV' grouping to an observation if the agent code is 'GOV', 'MIL', 'JUD', or 'PTY'. Creating new groupings is straightforward, but make sure that var_actors and variables in the first section of the function are edited accordingly so that variables based on new groupings are actually created.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
    # generating actor types #
    df$actor1type = NA
    df$actor2type = NA
    check1 = substr(df$Actor1Code, 4, 6) # pulling second 3 letter code - agent type
    check2 = substr(df$Actor2Code, 4, 6) # pulling second 3 letter code - agent type
    df$actor1type[check1 == 'OPP'] = 'OPP'
    df$actor2type[check2 == 'OPP'] = 'OPP'
    df$actor1type[check1== 'GOV' | check1=='MIL' | check1=='JUD' | 
    check1=='PTY'] = 'GOV'
    df$actor2type[check2== 'GOV' | check2=='MIL' | check2=='JUD' |
    check2=='PTY'] = 'GOV'
    df$actor1type[check1== 'REB' | check1=='INS'] = 'REB'
    df$actor2type[check2== 'REB' | check2=='INS'] = 'REB'
    df$actor1type[check1== 'EDU' | check1=='MED' | check1=='HLH' | 
    check1=='CVL' | check1=='BUS'] = 'SOC'
    df$actor2type[check2== 'EDU' | check2=='MED' | check2=='HLH' |
    check2=='CVL' | check2=='BUS'] = 'SOC'
    df$actor1type[check1=='NGO' | check1=='IGO'] = 'IOS'
    df$actor2type[check2=='NGO' | check2=='IGO'] = 'IOS'
    df$actor1type[check1=='USA'] = 'USA'
    df$actor2type[check2=='USA'] = 'USA'
    df = na.omit(df)

The next section iterates over the potential combinations of actor types and variable types, checks that they are in variables, and creates a new column if they are listed. It then puts a 1 in a column if the observation corresponds to that variable. It also creates columns ending in gold that record the Goldstein score for that actor combination. If you're aggregating on something other than quad categories, you may want to hash out the second 'if' section.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
    print('Creating count variables...')
    for(name1 in var_actors){
        for(name2 in var_actors){
            for(var_type in names(var_types)){
                var_name = paste(name1, name2, var_type, sep='_')
                var_gold = paste(name1, name2, 'gold', sep='_')
                if(is.element(var_name, variables)){
                    print(paste(var_name, 'is a variable', sep=' '))
                    check1 = df$actor1type
                    check2 = df$actor2type
                    check3 = df$quad
                    df[var_name] = 0
                    df[[var_name]][check1 == toupper(name1) & check2 == toupper(name2) & check3
                    == var_types[[var_type]]] = 1
                }
                if(is.element(var_gold, variables)){
                    print(paste(var_gold, 'is a variable', sep='_'))
                    df[var_gold] = 0
                    df[var_gold] = df$goldstein
                    check1 = df$actor1type
                    check2 = df$actor2type
                    df[[var_gold]][check1 != toupper(name1) | check2 != toupper(name2)] = 0
                }
            }
        }
    }

This final section aggregates the data by country, year, and month. If you decide that you want to aggregate by day as well, just make sure that your day variable is included in lines 13 and 14 of this section. I had to use tryCatch in order to skip countries in all_actors that didn't experience any events. This is really only a problem if you're aggregating over a short time frame.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
    final_df = data.frame()
    print('Creating groupings...')
    for(actor in all_actors){
        print(paste('Processing', actor, sep=' '))
        check1 = substr(df$Actor1Code, 1, 3)
        check2 = substr(df$Actor2Code, 1, 3)
        actor_dataset = filter(df, (check1==actor & check2==actor) | (check1==actor & check2=='USA') | (check1=='USA' & check2==actor))
        actor_dataset$Actor1Code = actor_dataset$Actor2Code = actor_dataset$quad = 
        actor_dataset$goldstein = actor_dataset$actor1type = actor_dataset$actor2type = NULL
        tryCatch(
            {
                actor_dataset$country = actor
                am = melt(actor_dataset, id.vars=c('country', 'month','year'), na.rm=TRUE)
                actor_dataset = dcast(am, country+month+year~variable, value.var='value', 
                fun.aggregate=sum)
                final_df = rbind(final_df, actor_dataset)
                },
                error=function(cond){
                    message(paste('Skipped',actor, sep=' '))
                    message(cond)
                },
                finally = {
                    message(paste('Processed', actor, sep=' '))
                }
            )
    }
return(final_df)
}

This can take a while to run on the full ICEWS dataset, so I'll probably set this up to run in parallel at some point in the near to distant future. If you happen to run into any errors using this, please let me know. There's only so much testing I can do myself.