Ben Fisher

Jul 08, 2014

The Problem with the ROC Curve

The Receiver Operating Characteristic (ROC) curve is one of the most ubiquitous measures of accuracy in the machine learning literature. Originally developed for signal detection during World War II, the curve provides a graphical representation of a predictive model's accuracy by showing the true positive and false positive rates for a model at various predictive thresholds. These thresholds, also referred to as cutoff points, serve to divide observations based on their predicted probabilities. For example, a threshold of .5 means that the model classifies observations with predicted probabilities greater than or equal to .5 as positives, and those with predicted probabilities less than .5 as negatives. Because it the ideal cutoff point varies based on the problem at hand, the ROC curve provides a means of determining where the best cutoff point is.

The main metric derived from the ROC curve is the area under the curve (AUC) score, which, as the name suggests, is the area of the graph contained under the ROC curve. The ideal AUC is 1.0, which means the model correctly classifies all of the positive observations without any false positives. An AUC of .5 indicates that the model is no better than a coin flip. The AUC is often used as a means of model comparison. The intuition is that models with a higher AUC are doing a better job of assigning higher predicted probabilities to positive observations and lower predicted probabilities to negative observations.

Alt text

The figure above shows ROC curves comparing models for predicting Militarized Interstate Incidents [1]. With AUC scores all above .9, it looks like the models are all very good. This is when ROC curves become misleading. The axes only measure the rates of positive and negative observations, not the actual numbers. This is fine for a balanced dataset, but it is problematic for underbalanced datasets like the one here. In this case, there are about 300 positive observations and over 100,000 negative observations. Suddenly, achieving a true positive rate of .8 with a false positive rate of .2 doesn't look so good. A false postive rate of just .01 means 1,000 false positives. This is an unacceptable number if we are only getting 100 or so true positives in exchange.

This issue also renders the ROC curves usefulness for model comparison irrelevant for unbalanced data. Even if one model has a better AUC than another, it does not mean much if that is due to better performance when the false positive rate is greater than .1. The number of true positives are already overwhelmed by the number of false positives. Performance past this point is irrelevant. Really, the best model in this case would be the one that achieves the highest true positive rate, while maintaining a false positive rate of 0.

For conflict researchers who work almost exclusively with unbalanced data, either avoid using ROC curves and related AUC curves as metrics, or include the total number of positive and negative cases in the sample as a note attached to the figure. Fortunately, there are alternatives. I'll discuss one of these, the precision-recall curve, in another post.


[1] I won't get into what incidents are here, but those who are curious can check out the [Correlates of War Project page] (