Simple Screen Scraping

I’ve integrated a lot of data sources in my NumerousApp metrics and some of them come from screen scraping web pages. I’ve stumbled into a fairly general way to do this so I’m writing it up here.

First off, make sure the web site you are scraping from allows it. The example I’ll use here is scraping data off the LCRA “How full are our lakes” site:

This is a public-data site and we are allowed to scrape it. I’ll show how to get that “lake level” percentage (34% at the time I’m writing this) off the site in a way that works for a lot of other sites as well.

First off we need to inspect the HTML surrounding the number we want. I use Chrome which has a very easy way to do this:

LCRA screen scrape example

In this screen grab I have right-clicked on the highlighted “34%” and selected “Inspect Element” from the pop-up menu. Down below we see the crucial thing we are looking for: a “div” tag with a class identifier. This is the key. It can be a div tag, a span tag, or anything that has either a “class” or (even better) an “id” tag associated with it.

Because of the tools people use to create web sites like this, and sometimes because they specifically create the site with scraping in mind, the number part of the display (the “34%” in this case) will often be contained within an HTML element with its own tag with a unique identifier. If you’ve lucked into a site that meets this criteria you can screen-scrape it very simply without parsing much of the structure of the site at all. Here’s how I do it with python using the Requests library (for http) and BeautifulSoup (for HTML parsing):

import requests ; import bs4

selectThis = "[class=lcraHowFull-Percentage]"
q = requests.get("")
soup = bs4.BeautifulSoup(q.text)
items =

v = None
for s in items[0].stripped_strings:
    for c in "%$,":
        s = s.replace(c, "")
        v = float(s)
    except ValueError:


For the “selectThis” value we can use any CSS selector syntax; in this case the LCRA conveniently tagged the item with class “lcraHowFull-Percentage”

The code simply takes every string found under that search criteria, strips out some “noise” characters that often adorn numbers, and tries to convert a floating point number.

[ as an aside, there are faster ways to strip those characters out than the loop I wrote but unfortunately the string translation functions changed from python to python3; I wrote the code this way to work unchanged under either version of python ]

The first thing that successfully converts to a number is (we hope) the data we want. Easy!

One advantage of this is that we’ve completely ignored all the formatting, layout, and structure of the web page. We found a number, wherever it happened to be, that was tagged with the identifier we were looking for. This is both the strength and weakness of this technique. It’s probably robust against future changes in the web site formatting (assuming they don’t change the identifier). But it’s also just blindly accepting that “whatever the first number we find under that identifier is the one we are looking for”. It’s up to you whether this is “good enough” as far as scraping goes.

Of course my real code contains more error checking and an argparse section (so I can supply command arguments for the URL and the selector) and so forth. But the above code works as-is as a simple example.

Sometimes you have to do more work with the “soup” and explicitly parse/traverse the tree to find the data you want. But so far I’ve found that this simple outline works in a surprisingly wide variety of cases. All you have to do is find the right CSS selector to pluck the right class or id descriptor and off you go.

Google Ping Update

Update: here’s the traceroute when ping performance is 50+ msec:

1 pfsense.nw 1.578 ms
2 * * *
3 * * *
4 22.567 ms
5 18.703 ms
6 15.788 ms
7 21.964 ms
8 21.865 ms
9 (
10 ( 26.794 ms
11 21.649 ms
12 17.312 ms
13 20.543 ms
14 24.646 ms
15 24.955 ms
16 27.354 ms
17 43.774 ms
18 48.918 ms
19 44.747 ms
20 43.592 ms

Four extra hops. Hitting a google server presumably somewhere in Chicago instead of Dallas. ORD is the Chicago O’Hare airport code; I don’t really know whether the google data centers are in fact at/near airports or whether they are just using airport codes as a convenient naming scheme for a general area.

So, sometimes I get directed to Dallas, sometimes to Chicago. Will report if I ever see any other server locations.

As an aside, “” is google’s clever name for their network.

Hilltop Google Ping Performance

For the past month I have been measuring internet ping (ICMP ECHO) performance from my hilltop network. I do this with a script that runs every 30 minutes and measures the response time of as reported by ping.

The script first pings exactly once in an attempt to cache the DNS lookup (to avoid DNS time affecting the results). It then invokes the unix ping program with “-c 10” to do 10 pings. I throw out the highest and lowest result times and average the remaining 8. I record the results in a NumerousApp metric (of course).

The results are shown here:

Hilltop Google Ping

Raw data available in the numerous metric: Hilltop Google Ping

Ignore the occasional spikes when obviously some network disruption was causing consistently high ping times for that measurement (and there is one zero data point where a bug in the script caused a zero reading when the network was completely down).

What’s left is two different consistent readings – something in the low-mid 20msec response range and something averaging in the 50msec response range. It seems pretty apparent that there are two different routes between my hilltop network and and for whatever reason sometimes I’m hooked up to the faster/shorter route and sometimes the longer one.

Here, for your amusement, is the heart of my “pinggoo” script:

ping -c 10 $TARGET |
grep from | grep 'time=' |
sed -e 's,.*time=,,' | 
awk ' { print $1 } ' | sort -n | sed -e '1d' -e '$d' | 
awk 'BEGIN {SUM=0; N=0} {SUM=SUM+$1; N=N+1} END {print SUM/( (N>0) ? N : 1)}'

Here’s what a typical output line from ping looks like on my mac:

64 bytes from icmp_seq=0 ttl=53 time=22.388 ms

This is some truly fine shell hackery. It turns out the two grep statements are redundant (either one alone suffices) but I put them both in as a way to ensure I was really looking only at the successful ping lines (the ping program itself puts out a lot of other verbose output). Then the sed deletes everything prior to the ping time. The awk program print $1 separates out the time from the trailing “ms”. What we then have is (hopefully) a list of 10 numbers, one per line. I use sort to put them in numeric order, then sed to delete the first and last line (highest/lowest ping time) and then the final awk program to calculate the average.

I’m sure I could have done all this with a single pass of awk or a python program or something along those lines; however, one nice thing about this hackery is that it is fairly robust across ping variants; so far this has worked just fine on my mac and on Debian wheezy, even though the two ping programs have different output formats (but the essential “time=” part is similar enough on both to work unchanged with this script).

Here’s a traceroute I just did while I appear to be getting the faster performance:

 1  pfsense.nw 1.379 ms
 2  * * *
 3  * * *
 4 20.527 ms
 5 18.915 ms
 6 28.277 ms
 7 24.099 ms
 8 27.132 ms
 9 24.956 ms
10 21.302 ms
11 25.221 ms
12 24.043 ms
13 27.043 ms
14 26.225 ms
15 25.225 ms
16 23.943 ms

I edited out a bunch of the output detail to make it fit on this page better. Sixteen hops to whichever google server is serving me. If we interpret the hostname at face value my google server (at this moment) is in DFW somewhere.

I’m on Time Warner cable and was recently upgraded to 100Mb performance. This (unsurprisingly) doesn’t seem to have had any material impact on the ping times (throughput and latency being somewhat independent).

I’ll report back if I get any other interesting traceroute data especially when I’m in the 50msec performance arena.

Murphy’s Law

This will reaffirm your faith in Murphy’s Law.

Time Warner recently replaced my cable modem – upgraded for higher performance.

On Friday I went to check the physical install. The modem is down the hill – a quarter mile away – and connects to the house network via a long fiber run.

A long time ago I installed a “remote power rebooter” device for the cable modem so that on those all-too-often occasions when the modem needs to be physically reset it could be power cycled remotely/automatically. Of course the cable guy didn’t plug the new modem into this device, instead he plugged it directly into the wall.

As an aside: the remote control power gizmo I’m using is from Synaccess and it is awesome:

You set this box up to ping a remote address (e.g., and if it loses connectivity it will power-cycle the outlet. So any time my cable modem wedges it gets rebooted automatically when this device detects loss of internet connectivity.

On Friday I should have unplugged the cable modem and moved it back to the rebooter power outlet. But I was literally on my way out of the house to go out of town for the weekend. The Number One Rule of IT was looming large in my mind: “If it’s working, don’t mess with it”.

The cable modem was working. Power cycling it right before leaving seemed foolish. I could fix it at my leisure on Monday when I got back.

Ah, Murphy, I am so sorry I tempted you that way.

Of course within hours of me actually *being* out of town, the cable modem wedged for some inexplicable reason. I had no VPN access to my home network the entire time I was gone. It wasn’t a big deal, but it was annoying, especially since I knew exactly WHY the modem hadn’t rebooted itself automatically after getting wedged and yet I was a few thousand miles away from it to fix it.

Maybe Thoreau was right.


Here’s what I’ve been working on in the “office” … integrating different things into

You can monitor:

Check out the iPhone numerous application (Android coming soon; these guys have just gotten started). Disclaimer: they are friends of mine and I am an investor/advisor.