Welcome to the blog. For this first post, I'm going to talk about a paper I have been working on with my classmate, John Beieler.
I'll start off with a little background. This research builds on a government project known as ICEWS (Intergrated Conflict Early Warning System), whose original goal was to develop predictive models for events of interest, such as rebellions, in countries in Southeast Asia. The major difference between our work and that of ICEWS is that we make use of open source event data. We also use statistical models commonly found in data mining literature.
The period of analysis covers 1997 to 2009 and is aggregated at a monthly level Our event of interest in this case is rebellion. We use a dummy variable, coded as 1 if a rebellion either began or was ongoing that month, and 0 if peace. This variable is taken from the original ICEWS data. Our goal is to predict 6 months into the future.
We use data from GDELT (Global Database of Events, Location, and Tone), for our independent variables. Each independent variable is a count of the number of a type of event that occurred each month. For example, the variable gov_opp_matcf represents the number of conflictual events that occurred between the government of that country and an opposition group that took place that month. There are 70 categories total of these actor-actor-event combinations. The data for some of these are fairly sparse, so many categories have a lot of zeroes.
For the analysis, we use 4 different models: logistic regression, support vector machine, random forest, and adaptive boosting. We use the same 75/25 train-test data split for each analysis. All hyper-parameters are tuned using 5-fold grid-search cross validation. In order to correct for both the sparsity and range of the data, we scale it so that the data have a mean of 0 and standard deviation of 1.
Out-of-sample test results for each model are displayed below. The first value is the precision score, the second is the recall score, and the third the F1 score.
- AdaBoost - 67% - 71% - 69%
- Logistic Regression - 63% - 81% - 71%
- Random Forest - 81% - 74% - 77%
- SVM - 66% - 74% - 70%
We avoid using base accuracy scores due to the nature of what we are trying to predict. Rebellions are rare events, so simply predicting zeroes for all observations would still yield very high accuracy. Random forest gets the best scores for precision and F1 scores, and logistic regression has the best recall. So, which is the best for us? In our case, we want to minimize false negatives and maximize our true positives. While we don't want false positives, we would prefer those to true negatives. The idea is that we would prefer to say a rebellion will happen and be wrong than to predict no rebellion and be unprepared. Logistic regression seems to be marginally better than random forest for use.
Below are ROC plots for each of the models. These plot the fraction of true positives vs the fraction of false positives as the discrimination threshold is varied.
The only one that doesn't perform too well is adaptive boosting. Random forest and SVM both have slightly higher area under the curve than logistic regression. However, we are less concerned with false positives than false negatives, which ROC curves don't tell us much about.
This all still needs some work. The main issue at the moment is that we are predicting every month that a rebellion is ongoing, rather than only rebellion onset. We are currently reformatting the data to accomodate this approach. Another problem we need to deal with is that we are predicting the outcome for each month with just the data from (e.g. predicting December with just data from July). We may get better accuracy by weighting data from the months leading up to the 6 month cutoff.
Overall, it's a good start. We're getting good accuracy in our initial results. More importantly, this has all been done with open source data. This also provides a good assessment of how new (at least to political science) data mining methods compare to the workhorse logit model.
Those who want to know more about ICEWS should see O'Brien 2010 "Crisis early warning and decision support: Contemporary approaches and thoughts on future research."
The working paper can be found here
The ICEWS data is still "official use only" unfortunately, so I can only provides the data we used for the independent variables.
Feel free to email me if you have any questions or comments.
2/6/14: Due to the ongoing dispute over GDELT, I've taken down the data on the independent variables.