Random Python Performance Posting

I’ve been writing some very elementary probability simulations in python and wanted to simulate millions of coin flips. I started out with this somewhat-obvious code:

    ht = [0, 0]
    for i in range(n):
        ht[random.choice((0, 1))] += 1

which counts heads/tails in ht[0] and ht[1] and performs n flips.

Can this code be sped up? Inserting time.time() calls before/after the “for” loop I found that this code can perform 10 million flips in 11.8 seconds. So it’s about 1.18 microseconds per flip.

[ aside: all measurements were made on my MacBook Air. I tried to avoid doing anything else while running the timing tests, and in all cases took the fastest result I could get. All results for a given method were similar and were run multiple times at multiple different points in time, so I believe there were no confounding factors with background activities. Your mileage may vary. ]

Borrowing a trick from the old geezer days, I thought “surely if we unroll the loop this will go faster”.  So I changed it to “for i in range(n//100)” and then put 100 of the “ht[random.choice(…” statements in the body of the loop.

Result? 1.16 microseconds per flip. Ok, that’s faster, but by less than 1%. Kudos to python for having efficient looping, plus, of course, the random() call dominates the cost anyway.

Ok maybe random.choice is slower than explicitly asking for a random number:

    ht = [0, 0]
    for i in range(n):
	ht[random.randrange(2)] += 1

Nope, that came in significantly slower! 1.33 microseconds per flip.

Maybe I could speed it up by eliminating array access and just adding up +1 and -1 values in a loop and reconstructing the heads vs. tails numbers after:

    delta = 0
    for i in range(n):
	delta += random.choice((-1, 1))  

That ran as fast as 1.14 microseconds per flip, but still just an insignificant improvement over the the original and arguably more “obvious” code.

Of course, by this point readers are screaming at me “duh, all the time is in the random function and there’s nothing you can do about it” and they are right. I should have tested that first. For example, this code:

    ht = [0, 0]
    for i in range(n):
        ht[1] += 1

runs at 0.11 microseconds per “flip” (albeit there is no flip, it just always adds to ht[1] in this example). Or if we go all the way to the delta method, it will be as little as 0.06 microseconds per “flip” for this:

    delta = 0
    for i in range(n):
        delta += 1

obviously there is no “flip” here but it’s fast.

Of course, we can use a single call to random to generate more than one random bit at a time. In C we might generate a 32-bit random number and then peel off the bits one by one. I’m guessing that doing this in python won’t be as efficient, but there’s still a way to easily amortize one call to the random method over multiple flips. For example we can enumerate all sixteen possibilities for four flips. In binary, they would be: 0000, 0001, 0010, 0011, … 1100 1101 1110 1111. We can note that the first (“0000”) is 4 heads (arbitrarily calling the 0 heads, it doesn’t matter which) and 0 tails. The next one (“0001”) is 3 heads and 1 tails, and so on. The complete enumeration of the outcomes looks like this:

    fourflips = ((4, 0),    # 0000                                              
                 (3, 1),    # 0001                                              
                 (3, 1),    # 0010                                              
                 (2, 2),    # 0011                                              
                 (3, 1),    # 0100                                              
                 (2, 2),    # 0101                                              
                 (2, 2),    # 0110                                              
                 (1, 3),    # 0111                                              
                 (3, 1),    # 1000                                              
                 (2, 2),    # 1001                                              
                 (2, 2),    # 1010                                              
                 (1, 3),    # 1011                                              
                 (2, 2),    # 1100                                              
                 (1, 3),    # 1101                                              
                 (1, 3),    # 1110                                              
                 (0, 4))    # 1111           

and now our loop could look like this:

    ht = [0, 0]
    for i in range(n//4):
        flips = random.choice(fourflips)
        ht[0] += flips[0]
        ht[1] += flips[1]

Running this code we get 0.330 microseconds per flip. About three and a half times as fast as the original code – which makes sense, because we are calling the random function one fourth as many times and that call dominates the timing.

Of course, in doing this optimization I’ve lost the data on individual flips and now I only know the relative number of heads and tails for the four flips that are grouped together. In other words, I can no longer distinguish between a group of four flips that were “HTHT” vs “HHTT” or many other combinations that will all appear as just (2, 2) when returned by random.choice. In my application this is unimportant. If it were important, I could of course have stored the individual flips in the fourflips variable, constructing it this way:

    fourflips = (
            (0, 0, 0, 0),
            (0, 0, 0, 1),
            (0, 0, 1, 0),
     ... etc

and processing it accordingly. I stuck with the aggregated counts because that’s all I cared about in my simulations.

We can take this one level further and combine the running delta method (+1 for a heads, -1 for a tails, and reconstruct the final number of heads and tails from the total number of flips and the delta), plus we can compute the delta ahead of time for each element of the fourflips list (e.g., (0, 4) is a delta of -4, (2, 2) is a delta of 0, etc). Combining those concepts we get:

    ht = [0, 0]
    fourdeltas = (4, 2, 2, 0, 2, 0, 0, -2,
                  2, 0, 0, -2, 0, -2, -2, -4)
    delta = 0
    for i in range(n//4):
        delta += random.choice(fourdeltas)

which gets us all the way down to 0.285 microseconds per flip.

How far can this optimization go? As far as you have patience for and as large as you want the list of deltas to be. I tested the same idea at five flips precomputed as deltas (i.e., five bits at a time) and got down to 0.232 microseconds per flip. There’s a diminishing return for each additional bit added – at five deltas we’re calling the random function 20% as often as the original code, but at four deltas we were already down to 25% so the improvement isn’t as dramatic. Also the deltas list doubles in size each time though of course that can be precomputed by a program so it’s not clear that there’s a pragmatic limit. I don’t know if at some point the length of the deltas list makes random.choice execute slower; I don’t think it would but it depends on implementation details.

For my purposes I was happy at five bits per call to random.choice and 0.232 microseconds per flip, which is more or less five times faster than the original.

 

Correlation vs. Causation

In case you needed any more evidence of the link between fossil fuel imports (at least, from Norway) and harm to people’s health, I present you:

Lego Build: Porsche GT3

Finally getting around to building this Porsche GT3  lego kit I received as a gift. It’s insane!

More posts later, but for now I just want to post these two pictures:

1) Modeled pistons inside an engine

2) Covered up in the final assembly

So, “you know it’s there, even if you can’t see it”. That is some serious attention to detail!

 

Later I’ll be gushing about the fact this thing has an actual, functioning, transmission in it.

 

USENET lives – sort of

I was “there” during the early days of Usenet, back when:

  • News was updated only twice a day, when our server dialed up another machine we had a mutual exchange with.
  • Business cards had bang-paths on them. For example: harvard!cfisun!palladium!nw on an old one of mine.
  • Articles were expired after just a few days because disk space was perpetually scarce.
  • The Great Renaming happened (1987) and we went from everything being “net.foo” to comp.foo, news.foo, talk.foo, etc.

I just naturally assumed the modern world had done away with usenet, so I was amused/surprised to find it (“USENET”) as the answer to this clue in today’s Wall Street Journal crossword puzzle:

I have to wonder what percentage of Wall St Journal crossword puzzle enthusiasts have ever heard of usenet, let alone ever posted on it!

P.S. There is no Cabal.

Fanless FreeBSD – Kingdel PC

Being a crusty UNIX guy, sometimes I prefer FreeBSD as a dedicated headless server instead of Linux. I recently needed a quiet (fanless) box and purchased this Kingdel box from Amazon.

Front view:

 

Rear panel:

 

It came with Windows10 pre-installed, which I promptly wiped out with a full installation of FreeBSD11.1 (amd64). There were only two tricky parts that I’m documenting here in the hopes that someone’s google search will stumble upon them if needed.

First, the BIOS was configured with only a 1 second delay for hitting the magic key (DELETE) to abort the Windows10 boot. I couldn’t remember the right key (is it always DELETE these days?) and since the delay was so short I couldn’t read the message “hit DELETE to stop boot” in the power-up screen. Google to the rescue and then “keep pressing DELETE over and over again during power up” worked.

Second, I had to fool with the BIOS settings to get it to recognize my external USB CD-ROM drive (containing the FreeBSD iso installation image). I had to change the power-on device recognition delay from “automatic” to “manual” and put in a 5 second delay, which made it work. Your mileage may vary depending on what external CD-ROM drive you have. I’m using one that is literally a decade old. It seems clear the Kingdel people (reasonably) turned all the delay knobs to the minimum values to speed bootup into the pre-installed Windows.

A note on how to make the WiFi work. The FreeBSD name for the WiFi device is iwn0. Follow the standard instructions for configuring FreeBSD WiFi, but note that they are written for the “ath” driver not the “iwn” driver (so substitute accordingly).

This means put the following into /etc/rc.conf:

wlans_iwn0="wlan0"
ifconfig_wlan0="WPA SYNCDHCP"

and create the file /etc/wpa_supplicant.conf containing (for example);

network={
   ssid="put SSID here"
   psk="put password here"
}

Your mileage may vary depending on your specific WiFi configuration requirements; in my case I tested this procedure just to make sure the WiFi adapter works (it did) but for my application my device is hardwired.

Two LAN interfaces, a WiFi interface, four serial ports, a zillion USB ports; Kingdel markets this as an “Industrail” computer (note misspelling, lmao). I’m using it to run a bunch of automation scripts and python code and the like, for which it is overkill (I had been running this on a Pi) but still silent.