In my last post, I dug a little bit into the difference between accuracy and predictive value, and how those get confused when using “accuracy” in conversation. When people ask “how accurate is it?” they aren’t usually asking in the data scientist sense. They aren’t even asking the question I answered last time, “When the light turns red, what are the odds it’s a real problem?” They’re asking: “How useful is this thing in directing my attention where it needs to be?”
Today I’ll give a tour of a a good tool for answering that and explain how we use it to tune results to clients’ specific needs. In short: we’ll bridge the gap between technical accuracy and usefulness.
The ROC Curve
Let’s revisit that student risk model. The usefulness of the model depends on how you can respond to it. We need a way to discuss model performance that allows trade-offs between positive and negative factors. For that, we look at a nifty tool for describing binary classifiers: the ROC curve.
Here’s the ROC curve for one for one of our student risk models. It’s the zero-week model we use when we have very little data on a student’s history.
I’ve labeled some important features:
- (A) Our model’s performance from the last post
- (B) The point of perfect prediction. It’s not a real place. It’s a utopia where only photographers of fast-food menu items can go. In the real world you only end up here with a trivially simple model or egregious over-fitting.
- (C) The zone of “Oops, I hooked it up backwards.” A point here shows negative predictive value: if you’re consistently here, you have a working model, but its sense of positive and negative is reversed.
The blue curve shows that if you’re willing to tolerate a very high false positive rate, you can always classify all of the true positives correctly. You’ll just capture a lot of garbage too. If you can’t tolerate very many false positives, you’ll pass over a lot of real positives. You can pick any point along this curve.
Now that we know how to read it, how do we use it?
One Option: Tune for Response Capacity
Only have time to deal with 100 interventions? Set your alert threshold so the total positive rate (the true and false positives) is 100. That way you’re capturing as many real positives as possible without being overwhelmed.
Another Option: Change the Constraint
Rather than tuning your threshold, you could tune your capacity. Let’s say you want to catch 95% of real problems. Picture point A sliding up the blue curve until it hits .95 on the true positive rate and do some quick math on the 80-something percent false positive rate that would bring. Without recomputing the whole confusion matrix, this graph can show you that you’d have about 1/3 more work than the current threshold.
Awesome Option: Compare Multiple Models
Here’s a graph with another model on it.
The green curve is the model we can use after we collect more information about students. This graph shows you a lot:
- Both models work
- The weekly model works better in all cases and should be adopted as early as possible
- An early response policy is expensive: the same staff can respond to more real problems on the green line
This is the real power of the ROC curve. It lets you make predictions about what will happen in a variety of models and scenarios. That’s how we actually discuss model “accuracy” in the commonly understood sense. We want to know which model will perform better at an interesting point. We can:
- select a threshold based on an ability to respond
- compare multiple models’ performance right at the interesting point
- answer “what-if” questions so we can evaluate changing the constraints instead of the model
I can hear a lot of my audience thinking “but that will never work here…” I’m interested in hearing the rest of that thought. Have a different problem? Or can’t figure out how this applies to you? Reach out to me or Mike. We have a lot more tools in the bag, and we’re always interested in a new challenge.