Developing a Bot Army

This may be a little different territory for this blog since it’s not directly related to data analytics. But, it still inhabits the same universe and I thought it was interesting enough to share.

Over the last few months I had an opportunity to work on a project that took a unique approach to data acquisition. The data that we were interested in came from various sources. However, it was not always available from feeds or dumps or even from APIs. A good portion of the data was only accessible by visiting a website, submitting a search form and then drilling down through the results. Since we were going after a lot of data, this would have to be automated in some fashion. Basically what we needed to set up was a scalable screen scraping operation.

Luckily, there is a fair amount of technology available to accomplish just this. The core technology falls under the banner of browser automation and is something that is used frequently by QA departments. At least by QA departments that strive to ensure their websites can run successfully in lot of different browsers.

Essentially, what browser automation requires is a browser, a script and a driver that can execute the script as a series of instructions sent to the browser just as if a user were interacting with it.

One of the most widely used frameworks for this is Selenium (http://docs.seleniumhq.org/). Selenium has a number of products for automated browser testing on a lot of different platforms, e.g. Java, .Net and Ruby.

selenium-logo

When you run a script using this framework what you you will see is a browser firing up on your desktop and then, kind of like a driverless car, you see web pages load, links clicked and forms filled out without any hands on the keyboard. It’s quite nerdly satisfying.

This is essentially what we needed but we still needed to figure out how to do this at a large scale. Again, luckily Selenium has developed an API which can be called from one machine to control a browser running on another machine. This means we could run a server that acted as a host for a bunch of browsers. In order to scale up we could simply add more browsers separate from whatever would be running our our scripts.

To run our scripts we looked to Node.js (http://nodejs.org/). Node.js, being JavaScript that runs on a server, is a sibling to the JavaScript that runs in a browser. Which is one of the reasons we chose Node.js as opposed to Java. And also, it’s not such a leap then to think that jQuery should be able comfortably to run in that context as well. And that’s exactly what we were able to do with a Node.js jQuery variant called Cheerio.

nodejs-logo

Using the Selenium protocol we could easily automate loading and navigating a website. Then, using nothing but the JavaScript idiom with NodeJs and Cheerio we could easily navigate the DOM structure of a web page to extract data contained therein. Once we had the data we could post to our own data collection API.

So this gave us the ability to fetch fairly unstructured data and turn it into structured data at a very large scale. From there we were able to feed it into our data pipeline as another data source. And then, step three, profit.

It’s Accurate, But Is It Useful?

In my last post, I dug a little bit into the difference between accuracy and predictive value, and how those get confused when using “accuracy” in conversation. When people ask “how accurate is it?” they aren’t usually asking in the data scientist sense. They aren’t even asking the question I answered last time, “When the light turns red, what are the odds it’s a real problem?” They’re asking: “How useful is this thing in directing my attention where it needs to be?

Today I’ll give a tour of a a good tool for answering that and explain how we use it to tune results to clients’ specific needs. In short: we’ll bridge the gap between technical accuracy and usefulness.

The ROC Curve

Let’s revisit that student risk model. The usefulness of the model depends on how you can respond to it. We need a way to discuss model performance that allows trade-offs between positive and negative factors. For that, we look at a nifty tool for describing binary classifiers: the ROC curve.

Here’s the ROC curve for one for one of our student risk models. It’s the zero-week model we use when we have very little data on a student’s history.

Roc0

I’ve labeled some important features:

  • (A) Our model’s performance from the last post
  • (B) The point of perfect prediction. It’s not a real place. It’s a utopia where only photographers of fast-food menu items can go. In the real world you only end up here with a trivially simple model or egregious over-fitting.
  • (C) The zone of “Oops, I hooked it up backwards.” A point here shows negative predictive value: if you’re consistently here, you have a working model, but its sense of positive and negative is reversed.

 

The blue curve shows that if you’re willing to tolerate a very high false positive rate, you can always classify all of the true positives correctly. You’ll just capture a lot of garbage too. If you can’t tolerate very many false positives, you’ll pass over a lot of real positives. You can pick any point along this curve.

Now that we know how to read it, how do we use it?

One Option: Tune for Response Capacity

Only have time to deal with 100 interventions? Set your alert threshold so the total positive rate (the true and false positives) is 100. That way you’re capturing as many real positives as possible without being overwhelmed.

Another Option: Change the Constraint

Rather than tuning your threshold, you could tune your capacity. Let’s say you want to catch 95% of real problems. Picture point A sliding up the blue curve until it hits .95 on the true positive rate and do some quick math on the 80-something percent false positive rate that would bring. Without recomputing the whole confusion matrix, this graph can show you that you’d have about 1/3 more work than the current threshold.

Awesome Option: Compare Multiple Models

Here’s a graph with another model on it.

Figure2

The green curve is the model we can use after we collect more information about students. This graph shows you a lot:

  • Both models work
  • The weekly model works better in all cases and should  be adopted as early as possible
  • An early response policy is expensive: the same staff can respond to more real problems on the green line

 

This is the real power of the ROC curve. It lets you make predictions about what will happen in a variety of models and scenarios. That’s how we actually discuss model “accuracy” in the commonly understood sense. We want to know which model will perform better at an interesting point. We can:

  • select a threshold based on an ability to respond
  • compare multiple models’ performance right at the interesting point
  • answer “what-if” questions so we can evaluate changing the constraints instead of the model

 

I can hear a lot of my audience thinking “but that will never work here…” I’m interested in hearing the rest of that thought. Have a different problem? Or can’t figure out how this applies to you? Reach out to me or Mike. We have a lot more tools in the bag, and we’re always interested in a new challenge.