All posts by Neil

Newest versions of NumerousApp API class libraries

As I wrote earlier, I have created Python and Ruby class libraries for the NumerousApp APIs. I’ve recently made a bunch of updates; the newest versions are:

And, as always, github repos: https://github.com/outofmbufs/

Ignore / Enjoy.

Arduino Data Monitor Program

Finally got around to uploading (to github) and documenting my Arduino program for monitoring analog inputs. It’s pretty cool; it:

  • Keeps a ring-buffer of readings and allows you to see what your monitored input has been doing over time.
  • Implements a web server – you can access the readings with your web browser.
  • Has a JSON interface
  • Can be used to provide network-accessible analog input readings.

Documentation: https://github.com/outofmbufs/arduino-datamon/wiki and of course source code on github too.

I’m using it with a simple circuit: a pull-up resistor (one side attached to +5V) and a photo-resistor (one side attached to the pull-up, the other to ground). This is a simple voltage divider and the voltage you will read at the midpoint will depend on how much light is hitting the sensors. I made four of these to monitor several different indicator lights on equipment in my house, as well as the basement area lights. I determined the appropriate resistor values experimentally; the behavior of the photo-resistor depends quite a bit on how much light hits it and thus how much of a difference it sees between “on” and “off”.

I hooked four of these simple circuits up to my Arduino and I use this data monitor code to record the readings. I can also get real-time readings. For example, right now I can tell that my basement area lights are on (I know there are workers doing maintenance today) because when I surf to http://monitor/v2 I get:

pin 2 value 401 @ 454619317

(“monitor” is the hostname of my arduino on my network). Granted, to interpret this I had to know that pin 2 (the “2” in the “v2″ part of the URL) is the basement light sensor pin and I had to know that values below about 800 means the basement lights are on. (pin 2 is connected to the photo-resistor monitoring basement lighting). Of course I’ve also written a small status program that shows me this in English. If I surf that web page (which runs a CGI script that queries the monitor and returns human-readable results):

Lights:

Basement: ON for 3.2 hours
Server Room: OFF for 5.3 days

Code for the arduino monitor on github as already mentioned: https://github.com/outofmbufs/arduino-datamon/

Simple Screen Scraping

I’ve integrated a lot of data sources in my NumerousApp metrics and some of them come from screen scraping web pages. I’ve stumbled into a fairly general way to do this so I’m writing it up here.

First off, make sure the web site you are scraping from allows it. The example I’ll use here is scraping data off the LCRA “How full are our lakes” site: http://www.lcra.org/water/Pages/default.aspx

This is a public-data site and we are allowed to scrape it. I’ll show how to get that “lake level” percentage (34% at the time I’m writing this) off the site in a way that works for a lot of other sites as well.

First off we need to inspect the HTML surrounding the number we want. I use Chrome which has a very easy way to do this:

LCRA screen scrape example

In this screen grab I have right-clicked on the highlighted “34%” and selected “Inspect Element” from the pop-up menu. Down below we see the crucial thing we are looking for: a “div” tag with a class identifier. This is the key. It can be a div tag, a span tag, or anything that has either a “class” or (even better) an “id” tag associated with it.

Because of the tools people use to create web sites like this, and sometimes because they specifically create the site with scraping in mind, the number part of the display (the “34%” in this case) will often be contained within an HTML element with its own tag with a unique identifier. If you’ve lucked into a site that meets this criteria you can screen-scrape it very simply without parsing much of the structure of the site at all. Here’s how I do it with python using the Requests library (for http) and BeautifulSoup (for HTML parsing):

import requests ; import bs4

selectThis = "[class=lcraHowFull-Percentage]"
q = requests.get("http://www.lcra.org/water/Pages/default.aspx")
soup = bs4.BeautifulSoup(q.text)
items = soup.select(selectThis)

v = None
for s in items[0].stripped_strings:
    for c in "%$,":
        s = s.replace(c, "")
    try:
        v = float(s)
        break
    except ValueError:
        pass

print(v)

For the “selectThis” value we can use any CSS selector syntax; in this case the LCRA conveniently tagged the item with class “lcraHowFull-Percentage”

The code simply takes every string found under that search criteria, strips out some “noise” characters that often adorn numbers, and tries to convert a floating point number.

[ as an aside, there are faster ways to strip those characters out than the loop I wrote but unfortunately the string translation functions changed from python to python3; I wrote the code this way to work unchanged under either version of python ]

The first thing that successfully converts to a number is (we hope) the data we want. Easy!

One advantage of this is that we’ve completely ignored all the formatting, layout, and structure of the web page. We found a number, wherever it happened to be, that was tagged with the identifier we were looking for. This is both the strength and weakness of this technique. It’s probably robust against future changes in the web site formatting (assuming they don’t change the identifier). But it’s also just blindly accepting that “whatever the first number we find under that identifier is the one we are looking for”. It’s up to you whether this is “good enough” as far as scraping goes.

Of course my real code contains more error checking and an argparse section (so I can supply command arguments for the URL and the selector) and so forth. But the above code works as-is as a simple example.

Sometimes you have to do more work with the “soup” and explicitly parse/traverse the tree to find the data you want. But so far I’ve found that this simple outline works in a surprisingly wide variety of cases. All you have to do is find the right CSS selector to pluck the right class or id descriptor and off you go.

Google Ping Update

Update: here’s the traceroute when ping performance is 50+ msec:

1 pfsense.nw 1.578 ms
2 * * *
3 * * *
4 tge7-2.ausbtx5202h.texas.rr.com 22.567 ms
5 tge8-5.ausbtx5201h.texas.rr.com 18.703 ms
6 tge0-12-0-6.ausutxla01r.texas.rr.com 15.788 ms
7 agg22.dllatxl301r.texas.rr.com 21.964 ms
8 107.14.17.136 21.865 ms
9 ae1.pr1.dfw10.tbone.rr.com (107.14.17.234)
10 207.86.210.125 (207.86.210.125) 26.794 ms
11 207.88.14.182.ptr.us.xo.net 21.649 ms
12 207.88.14.189.ptr.us.xo.net 17.312 ms
13 ip65-47-204-58.z204-47-65.customer.algx.net 20.543 ms
14 72.14.233.85 24.646 ms
15 72.14.237.219 24.955 ms
16 209.85.243.178 27.354 ms
17 72.14.239.136 43.774 ms
18 216.239.50.237 48.918 ms
19 209.85.243.55 44.747 ms
20 ord08s11-in-f20.1e100.net 43.592 ms

Four extra hops. Hitting a google server presumably somewhere in Chicago instead of Dallas. ORD is the Chicago O’Hare airport code; I don’t really know whether the google data centers are in fact at/near airports or whether they are just using airport codes as a convenient naming scheme for a general area.

So, sometimes I get directed to Dallas, sometimes to Chicago. Will report if I ever see any other server locations.

As an aside, “1e100.net” is google’s clever name for their network.

Hilltop Google Ping Performance

For the past month I have been measuring internet ping (ICMP ECHO) performance from my hilltop network. I do this with a script that runs every 30 minutes and measures the response time of www.google.com as reported by ping.

The script first pings www.google.com exactly once in an attempt to cache the DNS lookup (to avoid DNS time affecting the results). It then invokes the unix ping program with “-c 10″ to do 10 pings. I throw out the highest and lowest result times and average the remaining 8. I record the results in a NumerousApp metric (of course).

The results are shown here:

Hilltop Google Ping

Raw data available in the numerous metric: Hilltop Google Ping

Ignore the occasional spikes when obviously some network disruption was causing consistently high ping times for that measurement (and there is one zero data point where a bug in the script caused a zero reading when the network was completely down).

What’s left is two different consistent readings – something in the low-mid 20msec response range and something averaging in the 50msec response range. It seems pretty apparent that there are two different routes between my hilltop network and www.google.com and for whatever reason sometimes I’m hooked up to the faster/shorter route and sometimes the longer one.

Here, for your amusement, is the heart of my “pinggoo” script:

ping -c 10 $TARGET |
grep from | grep 'time=' |
sed -e 's,.*time=,,' | 
awk ' { print $1 } ' | sort -n | sed -e '1d' -e '$d' | 
awk 'BEGIN {SUM=0; N=0} {SUM=SUM+$1; N=N+1} END {print SUM/( (N>0) ? N : 1)}'

Here’s what a typical output line from ping looks like on my mac:

64 bytes from 74.125.227.114: icmp_seq=0 ttl=53 time=22.388 ms

This is some truly fine shell hackery. It turns out the two grep statements are redundant (either one alone suffices) but I put them both in as a way to ensure I was really looking only at the successful ping lines (the ping program itself puts out a lot of other verbose output). Then the sed deletes everything prior to the ping time. The awk program print $1 separates out the time from the trailing “ms”. What we then have is (hopefully) a list of 10 numbers, one per line. I use sort to put them in numeric order, then sed to delete the first and last line (highest/lowest ping time) and then the final awk program to calculate the average.

I’m sure I could have done all this with a single pass of awk or a python program or something along those lines; however, one nice thing about this hackery is that it is fairly robust across ping variants; so far this has worked just fine on my mac and on Debian wheezy, even though the two ping programs have different output formats (but the essential “time=” part is similar enough on both to work unchanged with this script).

Here’s a traceroute I just did while I appear to be getting the faster performance:

 
 1  pfsense.nw 1.379 ms
 2  * * *
 3  * * *
 4  tge7-5.trswtx1202h.texas.rr.com 20.527 ms
 5  tge0-12-0-14.ausxtxir02r.texas.rr.com 18.915 ms
 6  agg22.hstqtxl301r.texas.rr.com 28.277 ms
 7  107.14.19.94 24.099 ms
 8  ae-0-0.cr0.dfw10.tbone.rr.com 27.132 ms
 9  ae0.pr1.dfw10.tbone.rr.com 24.956 ms
10  207.86.210.125 21.302 ms
11  207.88.14.182.ptr.us.xo.net 25.221 ms
12  207.88.14.189.ptr.us.xo.net 24.043 ms
13  ip65-47-204-58.z204-47-65.customer.algx.net 27.043 ms
14  72.14.233.77 26.225 ms
15  64.233.174.137 25.225 ms
16  dfw06s32-in-f18.1e100.net 23.943 ms

I edited out a bunch of the output detail to make it fit on this page better. Sixteen hops to whichever google server is serving me. If we interpret the hostname at face value my google server (at this moment) is in DFW somewhere.

I’m on Time Warner cable and was recently upgraded to 100Mb performance. This (unsurprisingly) doesn’t seem to have had any material impact on the ping times (throughput and latency being somewhat independent).

I’ll report back if I get any other interesting traceroute data especially when I’m in the 50msec performance arena.