Ben Fisher

Apr 28, 2015

Aggregating ICEWS Data in R

This is going to be a walkthrough of an R function I've written, which can be found here that aggregates the recently (finally!) released ICEWS data. aggregate_icews converts the data into country-month observations with columns counting events internal to a country by type and actor combination. For example, the gov_reb_matcf variable would be a count of material conflict events where the government was the source and a rebel group was the target. The event types are based on the standard quad categories: verbal cooperation, material cooperation, verbal conflict, and material conflict. The different actor types are based off groupings of agent types from the text_to_CAMEO program. The groupings are:

  • GOV - government actors, grouped from GOV, MIL, JUD
  • REB - rebel actors, grouped from REB, INS, IMG
  • OPP - political opposition, from just OPP
  • CVL - civil society, grouped from EDU, MED, HLH, CVL, BUS
  • IOS - international organizations, grouped from IGO, NGO
  • USA - United States

Now, this is just how I've decided to group them. There's obviously a lot of ways this could be done, and it only takes adding/deleting a line or two of code in the function to add or change a grouping.

Since the format that the ICEWS data comes in is not particularly easy to use, you'll need to first run Phil Schrodt's text_to_CAMEO program. If you don't have any experience with Python, you can run the program from R using the rPython package. Just make sure that the text_to_CAMEO and ICEWS files are all in your working directory.

1
2
3
library(rPython)
setwd('directory/with/files/')
python.load('text_to_CAMEO.py') # runs the python program, it may take a minute or two

This will generate a tab-delimited text file for each year of ICEWS data. The aggregation function takes a data frame as its main argument, so you'll want to read the text files into a data frame.

1
2
3
files = list.files(pattern='reduced.ICEWS.events.*.txt')
icews.data = do.call('rbind', lapply(files, function(x) read.table(x, header=FALSE,
    sep='\t')))

Now that we have the data ready to aggregate, I'm going to walk through the aggregate_icews function so that you can get an idea of how it works. I expect most people will end up modifying the function in some way for different aggregations.

In the first section, I create a vector of actor groupings, var_actors, a list of the quad categories and their corresponding numerical codes, var_types, and a vector of labels for the variables I'm going to create, variables. It's important to note that the function will only create a variable if it is listed in the variables vector. So, if you wanted to only get counts for dyads involving a government actor, then you would remove any label in variables that does not include 'gov'.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
aggregate_icews = function(df){
    require(lubridate)
    require(dplyr)
    require(reshape2)

    var_actors = c('gov','opp','soc','ios','usa','reb')
    var_types = list(vercp = 1, matcp = 2, vercf = 3, matcf = 4)
    variables = c('gov_gov_vercp', 'gov_gov_matcp', 'gov_gov_vercf', 'gov_gov_matcf',
             'gov_gov_gold', 'gov_opp_vercp', 'gov_opp_matcp', 'gov_opp_vercf',
             'gov_opp_matcf', 'opp_gov_vercp', 'opp_gov_matcp', 'opp_gov_vercf',
             'opp_gov_matcf', 'opp_gov_gold', 'gov_reb_vercp', 'gov_reb_matcp',
             'gov_reb_vercf', 'gov_reb_matcf', 'gov_reb_gold', 'reb_gov_vercp',
             'reb_gov_matcp', 'reb_gov_vercf', 'reb_gov_matcf', 'reb_gov_gold',
             'gov_soc_vercp', 'gov_soc_matcp', 'gov_soc_vercf', 'gov_soc_matcf',
             'gov_soc_gold', 'soc_gov_vercp', 'soc_gov_matcp', 'soc_gov_vercf',
             'soc_gov_matcf', 'soc_gov_gold', 'gov_ios_vercp', 'gov_ios_matcp', 
             'gov_ios_vercf', 'gov_ios_matcf', 'gov_ios_gold', 'ios_gov_vercp',
             'ios_gov_matcp', 'ios_gov_vercf', 'ios_gov_matcf', 'ios_gov_gold',
             'gov_usa_vercp', 'gov_usa_matcp', 'gov_usa_vercf', 'gov_usa_matcf',
             'gov_usa_gold', 'usa_gov_vercp', 'usa_gov_matcp', 'usa_gov_vercf', 
             'usa_gov_matcf', 'usa_gov_gold', 'opp_reb_vercp', 'opp_reb_matcp', 
             'opp_reb_vercf', 'opp_reb_matcf', 'opp_reb_gold', 'reb_opp_vercp', 
             'reb_opp_matcp', 'reb_opp_vercf', 'reb_opp_matcf', 'reb_opp_gold', 
             'opp_opp_vercp', 'opp_opp_matcp', 'opp_opp_vercf', 'opp_opp_matcf',
             'opp_opp_gold', 'reb_reb_vercp', 'reb_reb_matcp', 'reb_reb_vercf', 
             'reb_reb_matcf', 'reb_reb_gold', 'opp_soc_vercp', 'opp_soc_matcp', 
             'opp_soc_vercf', 'opp_soc_matcf', 'opp_soc_gold', 'soc_opp_vercp', 
             'soc_opp_matcp', 'soc_opp_vercf', 'sco_opp_matcf', 'soc_opp_gold',
             'opp_ios_vercp', 'opp_ios_matcp', 'opp_ios_vercf', 'opp_ios_matcf',
             'opp_ios_gold', 'ios_opp_vercp', 'ios_opp_matcp', 'ios_opp_vercf', 
             'ios_opp_matcf', 'ios_opp_gold', 'opp_usa_vercp', 'opp_usa_matcp', 
             'opp_usa_vercf', 'opp_usa_matcf', 'opp_usa_gold', 'usa_opp_vercp', 
             'usa_opp_matcp', 'usa_opp_vercf', 'usa_opp_matcf', 'usa_opp_gold',
             'soc_ios_vercp', 'soc_ios_matcp', 'soc_ios_vercf', 'soc_ios_matcf',
             'soc_ios_gold', 'ios_soc_vercp', 'ios_soc_matcp', 'ios_soc_vercf', 
             'ios_soc_matcf', 'ios_soc_gold', 'soc_usa_vercp', 'soc_usa_matcp', 
             'soc_usa_vercf', 'soc_usa_matcf', 'soc_usa_gold', 'gov_gov_vercp', 
             'usa_soc_matcp', 'usa_soc_vercf', 'usa_soc_matcf', 'usa_soc_gold',
             'soc_soc_vercp', 'soc_soc_matcp', 'soc_soc_vercf', 'soc_soc_matcf',
             'soc_soc_gold')

Let's say we want to get counts for a specific event, rather than the bigger quad categories. Using the 'arrest/detain' category, we would alter var_types to look something like

1
var_types = list(arrest = 173)

where 'arrest' is what we're going to call that variable type and 173 is the CAMEO code corresponding to arrest/detain. We would then edit the labels in variables accordingly (e.g. 'gov_opp_arrest','gov_soc_arrest', etc.).

In the next section, I filter the data frame so that I only have within-country observations as well as those where the US was either the source are the target. all_actors is a vector of countries that we are going to generate counts for. Lines 5 through 10 create a few new variables to make aggregation easier and get rid of those we no longer need. If you want to aggregate by day rather than month, just generate a day column using lubridate's mday function.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
    colnames(df) = c('date','iso1','cow1','agent1','iso2','cow2','agent2','cameo','goldstein','quad')
    df = filter(df, as.character(df$iso1) == as.character(df$iso2) | df$iso1 == 'USA' | df$iso2 == 'USA')
    all_actors = as.vector(unique(df$iso1))
    all_actors = all_actors[all_actors != 'USA' & all_actors != '---' & nchar(all_actors) == 3]
    df$Actor1Code = paste(df$iso1, df$agent1, sep='')
    df$Actor2Code = paste(df$iso2, df$agent2, sep='')
    df$date = as.Date(df$date)
    df$year = year(df$date)
    df$month = month(df$date)
    df = df[c(13,14,11,12,10,9)]

The next section creates actor type groupings based on the agent codes. Lines 10-11, for example, assign the 'GOV' grouping to an observation if the agent code is 'GOV', 'MIL', 'JUD', or 'PTY'. Creating new groupings is straightforward, but make sure that var_actors and variables in the first section of the function are edited accordingly so that variables based on new groupings are actually created.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
    # generating actor types #
    df$actor1type = NA
    df$actor2type = NA
    check1 = substr(df$Actor1Code, 4, 6) # pulling second 3 letter code - agent type
    check2 = substr(df$Actor2Code, 4, 6) # pulling second 3 letter code - agent type
    df$actor1type[check1 == 'OPP'] = 'OPP'
    df$actor2type[check2 == 'OPP'] = 'OPP'
    df$actor1type[check1== 'GOV' | check1=='MIL' | check1=='JUD' | 
    check1=='PTY'] = 'GOV'
    df$actor2type[check2== 'GOV' | check2=='MIL' | check2=='JUD' |
    check2=='PTY'] = 'GOV'
    df$actor1type[check1== 'REB' | check1=='INS'] = 'REB'
    df$actor2type[check2== 'REB' | check2=='INS'] = 'REB'
    df$actor1type[check1== 'EDU' | check1=='MED' | check1=='HLH' | 
    check1=='CVL' | check1=='BUS'] = 'SOC'
    df$actor2type[check2== 'EDU' | check2=='MED' | check2=='HLH' |
    check2=='CVL' | check2=='BUS'] = 'SOC'
    df$actor1type[check1=='NGO' | check1=='IGO'] = 'IOS'
    df$actor2type[check2=='NGO' | check2=='IGO'] = 'IOS'
    df$actor1type[check1=='USA'] = 'USA'
    df$actor2type[check2=='USA'] = 'USA'
    df = na.omit(df)

The next section iterates over the potential combinations of actor types and variable types, checks that they are in variables, and creates a new column if they are listed. It then puts a 1 in a column if the observation corresponds to that variable. It also creates columns ending in gold that record the Goldstein score for that actor combination. If you're aggregating on something other than quad categories, you may want to hash out the second 'if' section.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
    print('Creating count variables...')
    for(name1 in var_actors){
        for(name2 in var_actors){
            for(var_type in names(var_types)){
                var_name = paste(name1, name2, var_type, sep='_')
                var_gold = paste(name1, name2, 'gold', sep='_')
                if(is.element(var_name, variables)){
                    print(paste(var_name, 'is a variable', sep=' '))
                    check1 = df$actor1type
                    check2 = df$actor2type
                    check3 = df$quad
                    df[var_name] = 0
                    df[[var_name]][check1 == toupper(name1) & check2 == toupper(name2) & check3
                    == var_types[[var_type]]] = 1
                }
                if(is.element(var_gold, variables)){
                    print(paste(var_gold, 'is a variable', sep='_'))
                    df[var_gold] = 0
                    df[var_gold] = df$goldstein
                    check1 = df$actor1type
                    check2 = df$actor2type
                    df[[var_gold]][check1 != toupper(name1) | check2 != toupper(name2)] = 0
                }
            }
        }
    }

This final section aggregates the data by country, year, and month. If you decide that you want to aggregate by day as well, just make sure that your day variable is included in lines 13 and 14 of this section. I had to use tryCatch in order to skip countries in all_actors that didn't experience any events. This is really only a problem if you're aggregating over a short time frame.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
    final_df = data.frame()
    print('Creating groupings...')
    for(actor in all_actors){
        print(paste('Processing', actor, sep=' '))
        check1 = substr(df$Actor1Code, 1, 3)
        check2 = substr(df$Actor2Code, 1, 3)
        actor_dataset = filter(df, (check1==actor & check2==actor) | (check1==actor & check2=='USA') | (check1=='USA' & check2==actor))
        actor_dataset$Actor1Code = actor_dataset$Actor2Code = actor_dataset$quad = 
        actor_dataset$goldstein = actor_dataset$actor1type = actor_dataset$actor2type = NULL
        tryCatch(
            {
                actor_dataset$country = actor
                am = melt(actor_dataset, id.vars=c('country', 'month','year'), na.rm=TRUE)
                actor_dataset = dcast(am, country+month+year~variable, value.var='value', 
                fun.aggregate=sum)
                final_df = rbind(final_df, actor_dataset)
                },
                error=function(cond){
                    message(paste('Skipped',actor, sep=' '))
                    message(cond)
                },
                finally = {
                    message(paste('Processed', actor, sep=' '))
                }
            )
    }
return(final_df)
}

This can take a while to run on the full ICEWS dataset, so I'll probably set this up to run in parallel at some point in the near to distant future. If you happen to run into any errors using this, please let me know. There's only so much testing I can do myself.

Feb 17, 2015

Counting the Dead

As Jay Ulfelder pointed out in a recent blog post, getting data on political violence, especially in undeveloped countries, is hard. Violent events can go unreported, either due to government suppression or a lack of reporters in the region. The information that does get through is rarely going to be perfect. As a result, determining specific numbers for dead and wounded turns into a morbid game of educated guessing.

To illustrate Jay's point that numbers are difficult to get at, I've done a rough comparison of two datasets that record incidents of violence against civilians. The motivation behind this effort is to show that even when researchers are able to collect data, the results are far from consistent. The two datasets are the Worldwide Atrocities Dataset, and the Armed Conflict Location and Event Data Project (ACLED). The former records violent acts against noncombatants that result in at least 5 fatalities - so anything from a suicide bomber to government troops firing on protestors - for all countries. The latter dataset examines armed conflict in general, with a focus on African countries. For the purposes of this blog post, I only look at ACLED events that involved violence against civilians.

Now, to get a few of the gory preprocessing details out of the way: this is looking at African countries from 1997-2012, I drop ACLED observations with fewer than 5 fatalities to better match the Atrocities coding rules, and I drop a couple of extreme outliers from each dataset.[1]

The two datasets record roughly the same number of observations - 1,009 in the Atrocities dataset and 1,037 in ACLED - but the total number of deaths recorded varies substantially. The Atrocities data for this sample records 83,913 fatalities, while ACLED records 78,164. The figure below is a plot of total the number of deaths per month recorded by each dataset. It's easy to see that the two datasets are giving very different death totals in some cases, particularly in the early 2000's. The statistical correlation between the two series is about .25.

Alt text

As another rough check, I did a comparison of deaths per month for just the Democratic Republic of the Congo. The DRC has a long history of civil conflict that has resulted in huge numbers of civilian casualties. It's a great example of a situation where we know violence against civilians is a major problem, but determining the scale is extremely difficult. As the figure below shows, the two datasets appear to capture the same major spikes in violence, but give very different numbers of dead in several cases.

Alt text

Unfortunately for researchers, I don't see this problem ever really getting better. I think there will be some improvement in reporting accuracy as mobile phone use increases, but the numbers are remarkably inconsistent as recently as 2010. However, this doesn't resolve the fact that our historical data are all over the place. The 'ground truth' of political violence remains murky at best.

Notes

[1] The Atrocities dataset has a campaign observation for Sudan where 50,000 people were killed, and ACLED records a major massacre in the Congo where 25,000 were killed. Neither of these observations were matched in the other dataset. I discarded them to keep the figures readable.

Data, figures, and replication code for this post can be accessed here.

Jul 08, 2014

The Problem with the ROC Curve

The Receiver Operating Characteristic (ROC) curve is one of the most ubiquitous measures of accuracy in the machine learning literature. Originally developed for signal detection during World War II, the curve provides a graphical representation of a predictive model's accuracy by showing the true positive and false positive rates for a model at various predictive thresholds. These thresholds, also referred to as cutoff points, serve to divide observations based on their predicted probabilities. For example, a threshold of .5 means that the model classifies observations with predicted probabilities greater than or equal to .5 as positives, and those with predicted probabilities less than .5 as negatives. Because it the ideal cutoff point varies based on the problem at hand, the ROC curve provides a means of determining where the best cutoff point is.

The main metric derived from the ROC curve is the area under the curve (AUC) score, which, as the name suggests, is the area of the graph contained under the ROC curve. The ideal AUC is 1.0, which means the model correctly classifies all of the positive observations without any false positives. An AUC of .5 indicates that the model is no better than a coin flip. The AUC is often used as a means of model comparison. The intuition is that models with a higher AUC are doing a better job of assigning higher predicted probabilities to positive observations and lower predicted probabilities to negative observations.

Alt text

The figure above shows ROC curves comparing models for predicting Militarized Interstate Incidents [1]. With AUC scores all above .9, it looks like the models are all very good. This is when ROC curves become misleading. The axes only measure the rates of positive and negative observations, not the actual numbers. This is fine for a balanced dataset, but it is problematic for underbalanced datasets like the one here. In this case, there are about 300 positive observations and over 100,000 negative observations. Suddenly, achieving a true positive rate of .8 with a false positive rate of .2 doesn't look so good. A false postive rate of just .01 means 1,000 false positives. This is an unacceptable number if we are only getting 100 or so true positives in exchange.

This issue also renders the ROC curves usefulness for model comparison irrelevant for unbalanced data. Even if one model has a better AUC than another, it does not mean much if that is due to better performance when the false positive rate is greater than .1. The number of true positives are already overwhelmed by the number of false positives. Performance past this point is irrelevant. Really, the best model in this case would be the one that achieves the highest true positive rate, while maintaining a false positive rate of 0.

For conflict researchers who work almost exclusively with unbalanced data, either avoid using ROC curves and related AUC curves as metrics, or include the total number of positive and negative cases in the sample as a note attached to the figure. Fortunately, there are alternatives. I'll discuss one of these, the precision-recall curve, in another post.

Notes

[1] I won't get into what incidents are here, but those who are curious can check out the [Correlates of War Project page] (http://correlatesofwar.org).

Jan 10, 2014

Forecasting Rebellion

Welcome to the blog. For this first post, I'm going to talk about a paper I have been working on with my classmate, John Beieler.

I'll start off with a little background. This research builds on a government project known as ICEWS (Intergrated Conflict Early Warning System), whose original goal was to develop predictive models for events of interest, such as rebellions, in countries in Southeast Asia. The major difference between our work and that of ICEWS is that we make use of open source event data. We also use statistical models commonly found in data mining literature.

Data

The period of analysis covers 1997 to 2009 and is aggregated at a monthly level Our event of interest in this case is rebellion. We use a dummy variable, coded as 1 if a rebellion either began or was ongoing that month, and 0 if peace. This variable is taken from the original ICEWS data. Our goal is to predict 6 months into the future.

We use data from GDELT (Global Database of Events, Location, and Tone), for our independent variables. Each independent variable is a count of the number of a type of event that occurred each month. For example, the variable gov_opp_matcf represents the number of conflictual events that occurred between the government of that country and an opposition group that took place that month. There are 70 categories total of these actor-actor-event combinations. The data for some of these are fairly sparse, so many categories have a lot of zeroes.

Analysis

For the analysis, we use 4 different models: logistic regression, support vector machine, random forest, and adaptive boosting. We use the same 75/25 train-test data split for each analysis. All hyper-parameters are tuned using 5-fold grid-search cross validation. In order to correct for both the sparsity and range of the data, we scale it so that the data have a mean of 0 and standard deviation of 1.

Out-of-sample test results for each model are displayed below. The first value is the precision score, the second is the recall score, and the third the F1 score.

  • AdaBoost - 67% - 71% - 69%
  • Logistic Regression - 63% - 81% - 71%
  • Random Forest - 81% - 74% - 77%
  • SVM - 66% - 74% - 70%

We avoid using base accuracy scores due to the nature of what we are trying to predict. Rebellions are rare events, so simply predicting zeroes for all observations would still yield very high accuracy. Random forest gets the best scores for precision and F1 scores, and logistic regression has the best recall. So, which is the best for us? In our case, we want to minimize false negatives and maximize our true positives. While we don't want false positives, we would prefer those to true negatives. The idea is that we would prefer to say a rebellion will happen and be wrong than to predict no rebellion and be unprepared. Logistic regression seems to be marginally better than random forest for use.

Below are ROC plots for each of the models. These plot the fraction of true positives vs the fraction of false positives as the discrimination threshold is varied.

Alt text Alt text Alt text Alt text

The only one that doesn't perform too well is adaptive boosting. Random forest and SVM both have slightly higher area under the curve than logistic regression. However, we are less concerned with false positives than false negatives, which ROC curves don't tell us much about.

Final Thoughts

This all still needs some work. The main issue at the moment is that we are predicting every month that a rebellion is ongoing, rather than only rebellion onset. We are currently reformatting the data to accomodate this approach. Another problem we need to deal with is that we are predicting the outcome for each month with just the data from (e.g. predicting December with just data from July). We may get better accuracy by weighting data from the months leading up to the 6 month cutoff.

Overall, it's a good start. We're getting good accuracy in our initial results. More importantly, this has all been done with open source data. This also provides a good assessment of how new (at least to political science) data mining methods compare to the workhorse logit model.

Notes

Those who want to know more about ICEWS should see O'Brien 2010 "Crisis early warning and decision support: Contemporary approaches and thoughts on future research."

The working paper can be found here

The ICEWS data is still "official use only" unfortunately, so I can only provides the data we used for the independent variables.

Feel free to email me if you have any questions or comments.

2/6/14: Due to the ongoing dispute over GDELT, I've taken down the data on the independent variables.