This may be a little different territory for this blog since it’s not directly related to data analytics. But, it still inhabits the same universe and I thought it was interesting enough to share.
Over the last few months I had an opportunity to work on a project that took a unique approach to data acquisition. The data that we were interested in came from various sources. However, it was not always available from feeds or dumps or even from APIs. A good portion of the data was only accessible by visiting a website, submitting a search form and then drilling down through the results. Since we were going after a lot of data, this would have to be automated in some fashion. Basically what we needed to set up was a scalable screen scraping operation.
Luckily, there is a fair amount of technology available to accomplish just this. The core technology falls under the banner of browser automation and is something that is used frequently by QA departments. At least by QA departments that strive to ensure their websites can run successfully in lot of different browsers.
Essentially, what browser automation requires is a browser, a script and a driver that can execute the script as a series of instructions sent to the browser just as if a user were interacting with it.
One of the most widely used frameworks for this is Selenium (http://docs.seleniumhq.org/). Selenium has a number of products for automated browser testing on a lot of different platforms, e.g. Java, .Net and Ruby.
When you run a script using this framework what you you will see is a browser firing up on your desktop and then, kind of like a driverless car, you see web pages load, links clicked and forms filled out without any hands on the keyboard. It’s quite nerdly satisfying.
This is essentially what we needed but we still needed to figure out how to do this at a large scale. Again, luckily Selenium has developed an API which can be called from one machine to control a browser running on another machine. This means we could run a server that acted as a host for a bunch of browsers. In order to scale up we could simply add more browsers separate from whatever would be running our our scripts.
So this gave us the ability to fetch fairly unstructured data and turn it into structured data at a very large scale. From there we were able to feed it into our data pipeline as another data source. And then, step three, profit.