Simple Screen Scraping

I’ve integrated a lot of data sources in my NumerousApp metrics and some of them come from screen scraping web pages. I’ve stumbled into a fairly general way to do this so I’m writing it up here.

First off, make sure the web site you are scraping from allows it. The example I’ll use here is scraping data off the LCRA “How full are our lakes” site:¬†http://www.lcra.org/water/Pages/default.aspx

This is a public-data site and we are allowed to scrape it. I’ll show how to get that “lake level” percentage (34% at the time I’m writing this) off the site in a way that works for a lot of other sites as well.

First off we need to inspect the HTML surrounding the number we want. I use Chrome which has a very easy way to do this:

LCRA screen scrape example

In this screen grab I have right-clicked on the highlighted “34%” and selected “Inspect Element” from the pop-up menu. Down below we see the crucial thing we are looking for: a “div” tag with a class identifier. This is the key. It can be a div tag, a span tag, or anything that has either a “class” or (even better) an “id” tag associated with it.

Because of the tools people use to create web sites like this, and sometimes because they specifically create the site with scraping in mind, the number part of the display (the “34%” in this case) will often be contained within an HTML element with its own tag with a unique identifier. If you’ve lucked into a site that meets this criteria you can screen-scrape it very simply without parsing much of the structure of the site at all. Here’s how I do it with python using the Requests library (for http) and BeautifulSoup (for HTML parsing):

import requests ; import bs4

selectThis = "[class=lcraHowFull-Percentage]"
q = requests.get("http://www.lcra.org/water/Pages/default.aspx")
soup = bs4.BeautifulSoup(q.text)
items = soup.select(selectThis)

v = None
for s in items[0].stripped_strings:
    for c in "%$,":
        s = s.replace(c, "")
    try:
        v = float(s)
        break
    except ValueError:
        pass

print(v)

For the “selectThis” value we can use any CSS selector syntax; in this case the LCRA conveniently tagged the item with class “lcraHowFull-Percentage”

The code simply takes every string found under that search criteria, strips out some “noise” characters that often adorn numbers, and tries to convert a floating point number.

[ as an aside, there are faster ways to strip those characters out than the loop I wrote but unfortunately the string translation functions changed from python to python3; I wrote the code this way to work unchanged under either version of python ]

The first thing that successfully converts to a number is (we hope) the data we want. Easy!

One advantage of this is that we’ve completely ignored all the formatting, layout, and structure of the web page. We found a number, wherever it happened to be, that was tagged with the identifier¬†we were looking for. This is both the strength and weakness of this technique. It’s probably robust against future changes in the web site formatting (assuming they don’t change the identifier). But it’s also just blindly accepting that “whatever the first number we find under that identifier is the one we are looking for”. It’s up to you whether this is “good enough” as far as scraping goes.

Of course my real code contains more error checking and an argparse section (so I can supply command arguments for the URL and the selector) and so forth. But the above code works as-is as a simple example.

Sometimes you have to do more work with the “soup” and explicitly parse/traverse the tree to find the data you want. But so far I’ve found that this simple outline works in a surprisingly wide variety of cases. All you have to do is find the right CSS selector to pluck the right class or id descriptor and off you go.

Leave a Reply

Your email address will not be published. Required fields are marked *