Coin flips and Catalan Numbers

In a previous post I wrote about this question:

On average, how many times do you have to flip a (fair) coin to get an equal number of heads and tails?

and suggested that the answer was probably infinite (on average), although half the time you get there in just two flips: HT or TH, which are half the possible cases for two flips: HH, HT, TH, TT.

To review: there are 4 possible combinations of two flips: HH, HT, TH, TT. Each is equally likely. Two of them – HT and TH – reach equal heads/tails equality. Let’s call this “reaching delta zero” where delta refers to the difference in heads vs tails count. Since two out of four equally-likely sequences reach delta zero, there is a 50% chance of getting there in just two flips.

The next stage has four flips (three flips will never have delta zero), and there are sixteen possibilities for four flips. However, half of them were eliminated at two flips, leaving only the sequences that start with HH or TT:

HHHH, HHHT, HHTH, HHTT
TTHH, TTHT, TTTH, TTTT

Two of these sequences reach delta zero: HHTT and TTHH. So there is a 2 out of 8 ( i.e., 1 out of 4) chance of reaching delta zero at four flips, except we have to account for the 50% chance of never even getting to four flips, so there is a 1 out 8 = 12.5% chance of finishing at four flips.

In the prior post I discussed the catalan numbers and how they relate to the number of these sequences. From that discussion we can compute the probabilities of a given trial run ending (reaching delta zero) at any particular stage. This works out as follows:

# FLIPS      % reaching equality here
    2            50%
    4            12.5%
    6             6.25%
    8             3.91%
   10             2.73%
   12             2.05%
   14             1.61%

Unlike the chart shown in the previous posting, these numbers are the chance of the trial ending at the specific level (e.g., a 3.91% chance any given trial run will take exactly 8 flips, no more / no fewer).

I wrote a simulation in python to test this, and I have been running it for over three weeks. Highlights of key results:

    • Trials Completed: 3,414,318. This is how many runs were made where each run consisted of python code along these lines (except the actual code used some unrolling and optimization techniques I’ve written about  before). This code is essentially simulating flipping a coin and counting the “delta” between the number of heads and tails, so each flip either adds one or subtracts one from that running delta and equality of heads and tails is reached when the delta becomes zero.
delta = random.choice((-1, 1))
flips = 1
while delta != 0:
    delta += random.choice((-1, 1))
    flips += 1
    • Total number of coin flips: 3,197,243,199,696. Yes, you are reading that correctly, over 3 trillion simulated flips. I will point out that python, like many (most? all?) runtimes these days uses the Mersenne Twister random number generator with period 2**19937-1; no older style random number generators that have much much shorter periods would be suitable for this sort of experiment.
    • Longest run: 1,204,531,561,152. Yes, that’s another “you read that right”. The longest string of coin flips it took to reach equality was a little over 1.2 trillion. That is, after the first flip generated a delta of +1 or -1, in that sequence it wasn’t until 1.2 trillion flips later that the delta was back down to zero.
    • Runs ending at 2 flips: 1,707,887 (50.02%). This is as we predicted. experimental results error is 728 out of over 3.4M trials, which comes to a difference of about 213 ppm (or however you want to think of that difference from the perfect 50%).

Here’s a breakdown of the percentages of trials completed in N flips, and the corresponding prediction:

length (flips) actual % theoretical %
2 50.02% 50.00%
4 12.498% 12.50%
6 6.251% 6.25%
8 3.901% 3.906%
10 2.739% 2.734%
12 2.061% 2.051%
14 1.613% 1.611%
16 1.312% 1.309%

I found this write-up nicely describing the math of this and in particular analyzing that the average will indeed diverge to infinity.

The results of seemingly-simple probability questions are often counter-intuitive. “How long will it take me to get an equal number of heads and tails while flipping a coin?” and the answer is “fairly often it could take a very very long time”

Flipping a coin until reaching equal heads and tails

Today’s question is: how many coin flips, on average, would we have to do if we keep flipping a coin until the number of heads equals the number of tails?

Obviously it takes at least two, and might take many more. After two flips there are four possible outcomes:

  1. HH: Heads on flip 1, heads on flip 2.
  2. HT: Heads on flip 1, tails on flip 2.
  3. TH: Tails on flip 1, heads on flip 2.
  4. TT: Tails on flip 1, tails on flip 2.

From basic probability theory (which you should go review if necessary), each of these four outcomes is equally likely. Two out of the four of them have an equal number of heads and tails; therefore, we know that 50% of the time we try: “Flip a coin until the number of heads and tails are equal” we will be done in two flips. But the other 50% of the time we will have to keep going.

How much longer in that other 50% of the time? The first thing that should be obvious is that there will have to be an even number of flips. There will never be an equal number of heads and tails in 3 flips, or 5 flips, or any odd number of flips.

So we have to do at least 2 more flips 50% of the time we start out (i.e., whenever we start out with HH or TT).

Starting with HH the next string of possibilities is:

  1. HH HH
  2. HH HT
  3. HH TH
  4. HH TT – this one reaches equal heads/tails

Similarly, starting with TT:

  1. TT HH – this one reaches equal heads/tails.
  2. TT HT
  3. TT TH
  4. TT TT

This time there were eight possibilities, and only 2 of them reach equality. Note that even though there are four flips of the coin there aren’t 16 possibilities, because eight of the sequences start with either HT or TH and therefore reach equality of heads and tails at 2 flips. Or looked at another way, those branches of the 2**n tree (2 possibilities for 1 flip, 4 possibilities for 2 flips,  8 for 3 flips, 16 for 4 flips, etc) were pruned off.

To see this explicitly, here is the complete list of all 16 possible sequences of 4 flips:

HHHH
HHHT
HHTH
HHTT

HTHH ** Never happens; equality after HT start
HTHT **    ...
HTTH **    ...
HTTT **    ...

THHH ** Never happens; equality after TH start
THHT **    ...
THTH **    ...
THTT **    ...

TTHH
TTHT
TTTH
TTTT

So, in review:

  • 50% of the time 2 flips is enough.
  • In the remaining 50% case:
    • 25% of the time (2 cases out of 8 as shown above) 4 flips is enough.
    • 75% of the time (6 cases out of 8) more than 4 will be needed.

We can start to do some math:

Percent of time reach equality in 4 flips or fewer:
    50% in 2 flips
    25% of 50% = 12.5% in 4 flips
    75% of 50% = 37.5% takes more than 4 flips

We can now say 62.5% (50% + 12.5%) of the time it will take 4 flips or fewer (i.e., 2 or 4) to get to equality, and 37.5% of the time it will take more than 4.

How many more? Carrying this case-by-case analysis out further gets messy quickly. Fortunately this problem is related to something called a Catalan number. A description of them can be found in the Online Encyclopedia of Integer Sequences (A000108). There is also a wikipedia entry which I paraphrase here:

There are many counting problems in combinatorics whose solution is given by the Catalan numbers.

C(n) is the number of Dyck words of length 2n. A Dyck word is a string consisting of n X’s and n Y’s such that no initial segment of the string has more Y’s than X’s.

For example, the following are the Dyck words of length 6:
XXXYYY XYXXYY XYXYXY XXYYXY XXYXYY.

The way they describe that is a bit confusing, plus we are actually interested in both XXXYYY (HHHTTT) and YYYXXX (TTTHHH). Here’s another description from wikipedia:

C(n) is the number of ways to form a “mountain range” with n upstrokes and n downstrokes that all stay above a horizontal line. The mountain range interpretation is that the mountains will never go below the horizon.

and this accompanying picture:

source/attribution: Wikipedia File:Mountain_Ranges.png

This is better and more directly illustrates why the numbers we need are two times the Catalan number, because we are interested in both mountain ranges that start by going up first (e.g., first flip is heads and eventually we need to get back “down” to equal heads/tails) or by going down first (e.g., first flip is tails and eventually we need to get back “up” to equal heads/tails).

Here is a listing of all possible sequences of 2 flips, 4 flips, 6 flips, 8 flips, and 10 flips that end up with an equal number of heads and tails:

# 2 sequences of 2 flips: 2 * Catalan(n=0)
HT
TH

# 2 sequences of 4 flips: 2 * Catalan(n=1)
HHTT
TTHH

# 4 sequences of 6 flips: 2 * Catalan(n=2)
HHHTTT
HHTHTT
TTHTHH
TTTHHH

# 10 sequences of 8 flips: 2 * Catalan(n=3)
HHHHTTTT
HHHTHTTT
HHHTTHTT
HHTHHTTT
HHTHTHTT
TTHTHTHH
TTHTTHHH
TTTHHTHH
TTTHTHHH
TTTTHHHH

# 28 sequences of 10 flips: 2 * Catalan(n=4)
HHHHHTTTTT
HHHHTHTTTT
HHHHTTHTTT
HHHHTTTHTT
HHHTHHTTTT
HHHTHTHTTT
HHHTHTTHTT
HHHTTHHTTT
HHHTTHTHTT
HHTHHHTTTT
HHTHHTHTTT
HHTHHTTHTT
HHTHTHHTTT
HHTHTHTHTT
TTHTHTHTHH
TTHTHTTHHH
TTHTTHHTHH
TTHTTHTHHH
TTHTTTHHHH
TTTHHTHTHH
TTTHHTTHHH
TTTHTHHTHH
TTTHTHTHHH
TTTHTTHHHH
TTTTHHHTHH
TTTTHHTHHH
TTTTHTHHHH
TTTTTHHHHH

Because we have two variations (heads first, tails first), the number of sequences possible for a given generation (“n” value) is double the catalan number C(n). The formula for the nth Catalan number is:

# python code. Note: // is integer division
def catalan(n):
    return comb(2*n, n) // (n + 1)

where comb() is the combinatoric function “m take n”.

Given all this we can compute how many patterns of n flips will result in our experiment finishing with an equal number of heads and tails. First let’s just print out a bunch of Catalan numbers (times 2):

import math

def comb(m, n):
    if m < n: 
        raise ValueError("requires m >= n")

    top = 1
    for i in range(n):
        top *= m-i
    return top // math.factorial(n)

def catalan(n):
    return comb(2*n, n) // (n + 1)

for n in range(20):
    print("flips={:2d}, 2C(n={})={}".format(
            2*(n+1), n, 2*catalan(n)))

Running this gives:

flips= 2, 2C(n=0)=2
flips= 4, 2C(n=1)=2
flips= 6, 2C(n=2)=4
flips= 8, 2C(n=3)=10
flips=10, 2C(n=4)=28
flips=12, 2C(n=5)=84
flips=14, 2C(n=6)=264
flips=16, 2C(n=7)=858
flips=18, 2C(n=8)=2860
flips=20, 2C(n=9)=9724
flips=22, 2C(n=10)=33592
flips=24, 2C(n=11)=117572
flips=26, 2C(n=12)=416024
flips=28, 2C(n=13)=1485800
flips=30, 2C(n=14)=5348880
flips=32, 2C(n=15)=19389690
flips=34, 2C(n=16)=70715340
flips=36, 2C(n=17)=259289580
flips=38, 2C(n=18)=955277400
flips=40, 2C(n=19)=3534526380

In the output above, think of n as the generation number where each generation is two flips. Like many things in computers and math we number the generations starting at zero, so the zeroth generation has 2 flips, and there are 2*C(0) == 2 combinations (HT, TH) that reach equality at that generation. Similarly the second generation (n=1) has four flips, and two of the sequences of four flips will reach equality at that generation. We showed this explicitly above by listing all the possibilities. These Catalan numbers become large quickly, confirming that explicit case-by-case analysis is going to get quite difficult. We need to figure out some general math instead.

The next thing we need to figure out is how many of the sequences there will be in each generation. We already know that it is less than 2**flips because some of them are cut off each generation – as was shown when there were only 8 sequences to consider at four flips (instead of 2**4 which would be 16).

If there are q sequences in generation n, the number of them that have equal numbers of heads and tails is given by 2*C(n); those sequences will not continue any further. Each remaining active sequence in generation n becomes four possible sequences in generation n+1, from appending HH, HT, TH, and TT outcomes.

Therefore, if q(n) is the number of active sequences in generation n, the number of sequences in generation n+1 will be:

q(0) = 4      # by definition
q(n+1) = 4 * (q(n) - 2*C(n))

Picking that back apart: q(n) is the number of sequences there will be at generation n, and 2*C(n) is the number of those that will reach equality. Therefore we subtract 2*C(n) from q(n) and that is the number of sequences that need to go for more flips; each of those sequences will expand to four possibilities in the next generation, hence the multiplication by 4.

It’s more convenient to express that equation in terms of the current generation and the previous generation (rather than current and next):

q(0) = 4
q(n) = 4 * (q(n-1) - 2*C(n-1))

As alread noted, some of those sequences will reach equality; for example this formula gives eight sequences at n=1 (four flips). We listed those eight before, and two of them (2*C(1)) reach equality at that generation (so six of them go on to n=2 / six flips).

This is a recursive formula that doesn’t let us compute an arbitrary q(n) without going through all the cases leading up to it, but just for grins lets write some code and print out a list of q(n) values and see if we agree with it:

for n in range(10):
    if n == 0:
	q = [4]
    else:
        q.append(4 * (q[n-1] - (2*catalan(n - 1))))

# q[i] will be the computed value at generation i
print(q)

This is some quick-and-dirty code that only works if we request the q values sequentially (as is done in this example), but it’s good enough for now. Running this (with the catalan function as previous defined) results in:

[4, 8, 24, 80, 280, 1008, 3696, 13728, 51480, 194480]

This matches the results we got manually – at n=0 (the very first pair of flips) there are 4 sequences to consider: HH, HT, TH, TT. At n=1, four flips, there are eight sequences to consider and we listed them above. We know, from that list, that 2 of those eight sequences reach equality, leaving 6 “active” sequences for the next generation, and we know that each of those 6 active sequences will turn into four new sequences (from appending HH, HT, TH, and TT) … so there will be 24 sequences to consider at the next generation.

Ok, so what does this tell us about how many times we’re going to have to flip that coin, on average, to get back to equal?

We have all the tools needed to calculate the chance that you reach equality by (or before) generation n. Let E(n) be the probability that we have an equal number of heads and tails by generation n (i.e.,  in 2*(n+1) flips). We know, by definition (inspection of all the cases: HH, HT, TH, TT) that E(0) = 0.50 (50%). The probability of reaching equality by arbitrary generation n will be made up of two components:

  • the chance you reached equality by or before generation n-1, which is E(n-1)
  • the chance you had to keep going, which is whatever is left after you take out the E(n-1) chance you already finished; so that is 1 - E(n-1), which we will then multiply by the chance that you reach equality on this nth generation.

We know from the above formulas that q(n) is the number of sequences “in consideration” in generation n. We also know that the number of sequences that will reach equality in generation n is given by: 2C(n) (twice the catalan number for n).

Therefore we know the chance you reach equality in a specific nth generation is the number of such sequences that reach equality, divided by the total number of sequences that would be in consideration at that generation:

2C(n)/q(n)

You can verify that yourself with the four flip case (n=1), where there were 8 sequences (q(1) == 8) and 2 of them (C(1) == 1, 2 times that is 2) reached equality, leading to a 25% chance of being done in that case (with that case only happening 50% of the time). Putting all that together we finally get:

E(n) = E(n-1) + (1 – E(n-1)) * 2C(n)/q(n)

Again we have a recursive formula that requires us to iterate through the cases. Here’s some code:

def q(n):
    if n == 0:
	return 4
    else:
        return (4 * (q(n-1) - (2*catalan(n-1))))

def bigE(n):
    if n == 0:
	return 0.50
    else:
	em1 = bigE(n-1)
	return em1 + ((1 - em1)*((2*catalan(n)) / q(n)))

Python experts will immediately realize there are several efficiency and memory problems lurking here;  on my machine trying bigE(500) takes about 5 seconds, and for large enough value bigE will fail with an exception because the recursion gets too deep. We can fix those problems but for now this code suffices to just get a feel for some results. I ran the calculations for some values and printed out the chance of NOT reaching equality (1-E(n)) by the nth generation:

Chance equality NOT reached after    2 flips: 50.00%
Chance equality NOT reached after    4 flips: 37.50%
Chance equality NOT reached after    6 flips: 31.25%
Chance equality NOT reached after    8 flips: 27.34%
Chance equality NOT reached after   10 flips: 24.61%
Chance equality NOT reached after   20 flips: 17.62%
Chance equality NOT reached after   50 flips: 11.23%
Chance equality NOT reached after  100 flips:  7.96%
Chance equality NOT reached after  200 flips:  5.63%
Chance equality NOT reached after  400 flips:  3.99%
Chance equality NOT reached after  600 flips:  3.26%
Chance equality NOT reached after  800 flips:  2.82%
Chance equality NOT reached after 1000 flips:  2.52%

These values seem pretty counter-intuitive to me. A 2.52% chance of not reaching equality within 1000 flips is huge. It means that if you sit down periodically and try this yourself with a coin (assuming it is a fair coin), about 1 time out of 40 you will have to do 1000 flips and still would have not reached equality. I think most people when asked how frequently that would happen (especially people who misunderstand the “law of averages”) would guess it to be very rare, and certainly not as likely as a 1 in 40 shot.

I opened this posting with the question “what would be the average number of flips it takes to reach equality” but I’m still not able to answer that question. I can’t tell yet whether these formulas, with their tailing-off values as they get large, converge onto a bounded number or whether in fact the average number of flips to reach equality might be unbounded (i.e., infinite). I’m going to wrap up this post just to get it “out there” and do some more google research later. I do wonder if I may have reached the limit of my mathematical pay grade in this area.

I have written a python simulator to do some “flip until equality” experiments, but I’m not yet prepared to write it up as I want to ensure that its results are valid, and without better math support I’m not sure yet. I will say the preliminary results from the python simulator suggest that the answer to “what is the average” might be “infinity” (!!) though I am not yet certain of that. I have definitely had some astoundingly-long runs come out of the simulator so far though — the longest I’ve seen at the moment being over 380 billion flips before equality was reached (I *think* the python random number generator is suitable for this task, having a period of 2**19937, but that’s just one of many questions I have at the moment about my simulation code).

With the caveat that I’m not 100% certain my code’s results are correct, here are some preliminary results of interest from the simulation.

I’ve run 1,607,009 trials (sequences of coin flips, each starting with one flip and finishing when the number of heads and tails is equal).  Out of those trials, 804,083 reached equality in two flips. That’s 50.036% which tracks the computed/expected result correctly. Equality in four flips was reached in 200,809 trials, or 12.496% which also tracks the 12.5% prediction. The longest sequence seen was 380,172,077,974 flips — 380 billion flips!

The top ten longest flip sequences at the moment are:

380,172,077,974
 33,378,243,830
 27,175,971,118
 25,567,423,414
 22,337,890,602
 11,709,891,508
 11,678,616,684
 10,918,847,300
 10,600,505,508
  9,424,437,132

These results are fueling my suspicion that the overall average might be unbounded… i.e., that these huge sequence length numbers will occur frequently enough to drive the average towards infinity. But I’m not certain of that (can’t really be certain of it until we do more math).

Sequences of length 2 occurred 804,093 times, length 4 occurred 200,809 times, etc so if you imagined a very long list of numbers – 804,093 entries with “2”, 200,809 entries with “4”, etc, and added them all up and computed the average entry value you’d get 398,917 (obviously I didn’t do this manually, I did it by weighted-average of my recorded data points). This number might be meaningless though, if my speculation about the average being unbounded is correct (i.e., if the average will continue to increase as more trials are run). But at the moment if you ask me to answer “Dammit, Neil, what’s the average?” I guess I’d say 400K flips needed, on average, to reach equality. Again that’s a huge number and somewhat counter-intuitive; you’d think a fair coin would reach my heads/tails equality criteria in “some reasonable” amount of time. Most human intuitions about probability are wrong, and assuming my simulation is returning valid results, this appears to be yet another case of that.

Building an HTTP/POST request/response protocol

In the previous post: Python Simple Client/Server Socket Communication Module I began exploring using python’s http.server module to build a simple HTTP server as framework for a request/response protocol that could invoke a remote function and return some results to the client.

My goal is to be able to write code like this:

# ON A SERVER MACHINE
def foo(input_dictionary):
  results = ... do something with input_dictionary
  return results

server = MakeServer(port=12345, func=foo)

# ON A CLIENT MACHINE
client = MakeClient(port=12345)

input_dictionary = { something ... }
results = client.command(input_dictionary)

and essentially have a trivial remote procedure call mechanism allowing the client code to invoke the foo() function on the given input.

We’re going to build this using python’s http.server module. We are going to POST to “/” (indeed, we are going to ignore the path parameter), and the data of the post itself will be a JSON encoded python object. Rather than having to parse a “Data-length” line or anything like that ourselves, the HTTP protocol will handle that part for us; all we have to do is pluck the data length value out of the HTTP header and then read and decode the posted body ourselves.

Expanding on the do_GET example from my previous post, here’s a do_POST server function:

from http.server import HTTPServer
from http.server import BaseHTTPRequestHandler
from http.server import HTTPStatus
import json

class MyRequestHandler(BaseHTTPRequestHandler):
    def do_POST(self):
        datalen = int(self.headers['Content-Length'])
        data = self.rfile.read(datalen)
        obj = json.loads(data)
        print("Got object: {}".format(obj))
        self.send_response(HTTPStatus.OK)
        self.end_headers()

server = HTTPServer(('', 12345), MyRequestHandler)
server.serve_forever()

This simple code needs more error checking. There is also an implicit conversion between 8-bit over-the-wire bytes and python internal string representation (full unicode) hiding inside the json.loads function. In fact, if you are running a version of python older than 3.6 you might be getting an error message related to this and the distinction between a python bytes object (returned by read) and a python string (expected by json.loads). So let’s fix that before going further.

In the good old days, all computing was done in English and the 8-bit ASCII character set was good enough for everyone. All characters fit into one byte, all bytes and strings could be considered more or less as interchangeable things. Obviously, those days are long gone. Even if you say you don’t care about japanese/chinese/etc speakers and their ability to send their characters (which would be a mistake of course), even English users demand full unicode support if for no other reason than to be able to put smiley faces and other emojis into their data. Unicode 😀❤️🐳💡🎉 Happens!

Starting in python 3.x python uses Unicode as the native format for strings. The good part about this is all your code automatically will work with all of those character types. The bad part about this is you have to be cognizant of how Unicode (more than 8 bits per character) interfaces with parts of the world that operate on 8-bit bytes – like TCP streams for example. Arguably this “bad” thing is actually a “good” thing as it forces you to make your code work for everyone, even people who don’t use only ASCII.

So to see Unicode in action in python, try this:

s = '\U0001F600'
#     ^ **NOTE** that's an uppercase U
print("s = /{}/ and len(s) = {}".format(s, len(s)))

This will show you:

 

 

It’s beyond the scope of this post to explain why we usually consider “native Unicode” to be an internal representation of characters and use a different external representation when sending Unicode strings “over the wire” in a protocol. I am just pointing out that this is something we have to do – pick an encoding method and use it properly on both ends.

The most common, standard, encoding used for applications like this is called utf-8, which has the advantage that the first 128 ASCII characters (all the “good old days” characters) are still encoded the same way, as a single byte, which tends to increase interoperability with naive/old programs that are not Unicode enabled (in fact this is one of the “beyond the scope of this post” reasons to encode characters rather than send everything in its 32-bit raw Unicode glory).

So we are going to convert our internal strings into a python bytes object on one side, and back on the other. A bytes object is an iterable sequence of 8-bit integers (0 .. 255), and has a decode method for converting that sequence of bytes into a Unicode string. For example:

>>> letterA = bytes([65]).decode('utf-8')
>>> print(letterA)
A

What this is showing you is that a single byte, with value 65, when decoded using the ‘utf-8’ encoding, becomes a string of one character, an uppercase A. This is demonstrating a property of utf-8, namely that the original (“good old days”) 8-bit ASCII character values are encoded as themselves. Other characters are encoded using multiple bytes, so for example:

>>> jpnhouse = bytes([229, 174, 182]).decode('utf-8')
>>> print(jpnhouse)
家

this is demonstrating that the three-byte sequence [229, 174, 182], when decoded using ‘utf-8’, will become a single character that is (I think) the word “house” in Japanese.

We don’t really need to understand encodings other than to know the encode/decode steps are there, have to be performed, and have to use the same encoding on both sides of the wire. Starting in python version 3.6, the json.loads function will accept a bytes object and do an implicit utf-8 decoding for you. This is why the first example code given up above “works” if you are running python 3.6 or later, but will fail with a complaint about strings versus bytes on earlier versions.  I think it is better practice for us to make that step explicit, which will also have the benefit of making that example code work on older versions of python (python 3.x).

With the explicit decode step the code becomes:

class MyRequestHandler(BaseHTTPRequestHandler):
    def do_POST(self):
        datalen = int(self.headers['Content-Length'])
        data_bytes = self.rfile.read(datalen)
        data_str = data_bytes.decode('utf-8')
        obj = json.loads(data_str)
        print("Got object: {}".format(obj))
        self.send_response(HTTPStatus.OK)
        self.end_headers()

This still isn’t sending any response (other than the “OK” HTTP code) so let’s add that. This revised handler takes out the print on the server side and wraps the received object inside another dictionary {'echo': obj} and then sends that back to the client as the response data:

class MyRequestHandler(BaseHTTPRequestHandler):
    def do_POST(self):
        datalen = int(self.headers['Content-Length'])
        obj = json.loads(
            self.rfile.read(datalen).decode('utf-8'))
        rslt_str = json.dumps({'echo': obj})
        rslt_bytes = rslt_str.encode('utf-8')
        self.send_response(HTTPStatus.OK)
        self.end_headers()
        self.wfile.write(rslt_bytes)

This works – but before putting it into production it should probably be enhanced in several ways. It leaves out several HTTP headers in the response; we should probably fill in Content-Type (‘application/json’ would be appropriate) and Content-Length. It turns out Content-Length isn’t “needed” because the default response format is HTTP/1.0 which defines the length by closing the stream at the end. But we might want to use HTTP/1.1 in the response format and include a Content-Length (and thus also allow for persistent connections which were not supported under the HTTP/1.0 format). All of these elaborations are left as an exercise for the reader at this point, in consultation with the http.server module documentation. I made many of these improvements in the code that I will also be posting on github (TBD at the time of writing this posting).

With this primitive server we have enough framework to use curl to fire commands at a server and have it invoke some function (in this case hardwired into “encapsulate the object and return it”) and return results to the client. This works really well with not very many lines of code!

Let’s build an explicit client instead of using curl (although being able to use curl does demonstrate one of the advantages of picking a standard transport protocol such as HTTP/POST). Here is a bare bones test request function:

from http.client import HTTPConnection
import json

def testrq(obj):
    c = HTTPConnection('localhost', 12345)
    c.connect()
    encoded = json.dumps(obj).encode('utf-8')
    c.request("POST", "/", body=encoded, headers={})
    response_bytes = c.getresponse().read()
    response_string = response_bytes.decode('utf-8')
    return json.loads(response_string)

This connects (to hardwired localhost:12345) and sends whatever object (obj) you provide, gets the response (see the http.client / HTTPConnection documentation) and decodes it as a JSON object. Obviously all semblance of generalization and error checking has been omitted here. But this works.

Now that we know how to send arbitrary python objects back and forth from client and server we can work on building a real, but still simple, generalized framework for all this. That will be the next post.

 

Python Simple Client/Server Socket Communication Module

I wanted a python module with a simple client/server request/response protocol … something that would let me invoke a function remotely, with code looking something like:

def server_function(input_dictionary):
  ... do something with the input_dictionary ...
  return {'result': 'blah blah blah'}
  
def client():
  server = ... connect to the server ...
  request = { some dictionary of stuff here }
  result = server.command(request)
  ... result is a dictionary as returned by the server

The client would formulate a request as a dictionary, send it to the server, the server processes it, formulates a dictionary as a reply, and returns that to the client. In effect this is a simplified form of a general remote procedure call system.

Google to the rescue? Yes, there are any number of github repositories and official modules out there that sort of do this. However, I had three problems with all the ones I found:

  • They had bugs in them regarding TCP stream semantics, or, perhaps, if not outright bugs they were written with limitations that wouldn’t generalize to transferring large amounts of data in a single request/response pair.
  • They worked at a byte stream level of abstraction, and I wanted a higher-level “message” or “packet” abstraction (more like the above “send a dictionary, get a dictionary” model).
  • OR … they were huge frameworks, that were very powerful but felt like overkill for my application. I didn’t need to set up an entire REST API, I didn’t need the features of a “real” application server, etc. Of course this is dangerous thinking, in that anything that starts out as a trivial 100-line hack might someday grow into something real. Nevertheless, I decided to proceed with a small implementation of something for my specific purpose. Though, as we’ll see, I did conclude that HTTP/POST was a suitable transport mechanism and ended up implemented what can only be thought of as a completely degenerate (one URL) imitation of a REST API. It’s up to you whether to think of what I’ve done here as something useful, perhaps for limited applications or at least as a learning environment, or a complete waste of time.

On the bugs front, let’s take a look at this code from 21.21.4.1 in the python version 3.6 library documentation for the socketserver module. Here’s the relevant part of their server example:

# self.request is the TCP socket connected to the client
data = self.request.recv(1024).strip()
# just send back the same data, but upper-cased
self.request.sendall(data.upper())

Leveraging the surrounding socketserver framework, this code receives a string over a socket, converts it to upper case, and sends that back as the response. It’s not hard to imagine generalizing this to where perhaps it is passing JSON-serialized python dictionaries (or other arbitrary serializable objects) – voila! Just what I was looking for.

But there’s a problem: TCP is a stream oriented protocol and not a “message oriented” protocol. There are potential bugs lurking in some subtle assumptions in the above code.

What happens if you want to send a 4000 byte string? Let’s try it. I modified the above code to say “4096” in the recv call and similarly modified the client. In fact, the python socket module documentation recommends 4096 as a reasonable value to specify in calls to recv():

Note For best match with hardware and network
realities, the value of bufsize should be a relatively
small power of 2, for example, 4096.

So with that in mind here are the relevant excerpts of the two sides modified to send 4000 bytes (and using a 4096 byte recv() buffer):

# Server, modified to read up to 4096 bytes
    def handle(self):
        data = self.request.recv(4096).strip()
        self.request.sendall(data.upper())
# client, modified to send/recv up to 4096 bytes
# and report the amount of data sent/received
with socket.socket(socket.AF_INET,
                   socket.SOCK_STREAM) as sock:
    data = "0123456789" * 400        # 4000 bytes total
    sock.connect((HOST, PORT))
    sock.sendall(bytes(data, "utf-8"))

    received = str(sock.recv(4096), "utf-8")
    print(
      "Bytes sent {}, bytes received {}".format(
         len(data), len(received)))

I ran the client in a loop and got output like this:

Bytes sent 4000, bytes received 4000
Bytes sent 4000, bytes received 2896
Bytes sent 4000, bytes received 2896
Bytes sent 4000, bytes received 2896
Bytes sent 4000, bytes received 2896
Bytes sent 4000, bytes received 4000
Bytes sent 4000, bytes received 1448
Bytes sent 4000, bytes received 2896
Bytes sent 4000, bytes received 2896
Bytes sent 4000, bytes received 2896
Bytes sent 4000, bytes received 2896
Bytes sent 4000, bytes received 2896
Bytes sent 4000, bytes received 2896

In fact it’s worse than this. If you put a print(len(data)) statement in the server to see what it thinks it is getting, you will find that sometimes the mismatch is on the server side (i.e., the server thinks the client sent less than 4000 bytes), and sometimes it is on the client side (i.e., the server got all the bytes, but the client thinks it got fewer bytes in the reply), and sometimes it is on both.

What’s going on???

TCP is a byte-stream protocol. It reliably delivers the bytes you give to it, and it delivers them in order. What it does NOT do is preserve any notion of message boundaries within that byte stream. We might make one call to sendall() with 4000 bytes, but that does not guarantee that a single recv() on the other side will get all 4000 bytes at once.

When TCP packets are sent “over the wire” (between two distinct machines, via some form of underlying network technology) they will be broken up into segments that have a maximum size, called, unsurprisingly, the maximum segment size, and abbreviated as MSS. The specific MSS value varies network by network, sometimes because of network technology differences, sometimes because of other factors.

If you run the above client and server on a single machine then there is no actual network between the two endpoints. The communication happens over TCP, but entirely within the confines of one machine. In that case it is likely that the code will work perfectly; 4000 bytes will be sent/received consistently on both sides (however this depends on operating system implementation details).

But if you split the client and server across two distinct machines, bringing real network technology (and limitations) into play, you will see results like those I got above. You might even see different results depending on whether your client and server are separated by a wired network, a WiFi network, a cellular network, and so forth, and whether it is a local network (from machine A to machine B in your own house, for example) or a true WAN (from machine A in your house to perhaps machine B running as an AWS cloud instance).

The fact that this simple-minded code “works” when you try it just on your own machine and doesn’t show any bugs until deployed on a larger network is a common source of misunderstanding – leading to hard-to-find bugs.

If the network can only send 1448 bytes in one “segment”, then sending 4000 bytes becomes (at least) a total of three TCP segments over the wire. They can be sent very quickly, so in many cases all three will arrive so fast that the other side will see all the data at once and will receive the full 4000 bytes. But this won’t always happen; sometimes the first segment will arrive and perhaps the second segment is delayed just long enough for the other side to not see it before the recv() call thinks it is finished. And that’s how we end up seeing only 1448 bytes transmitted sometimes, because the recv() call completes its work before the second chunk of data arrives at the destination machine.

Aside: sendall is a python-level convenience function and not an operating system (socket API) primitive. We don’t know whether sendall makes a single call to the underlying socket API to send data or breaks your data down into chunks and loops over multiple OS calls. However, it doesn’t matter. This segmentation of data sent over TCP will happen no matter what any time the MSS is smaller than the total amount of data you are trying to send.  

Emphasizing again: TCP is a byte-stream protocol, not a message protocol. If we want to send and receive structured messages, we have to layer our own application-layer protocol on top of the byte stream provided by TCP. So instead of just sending a JSON-encoded dictionary from client to server, which could easily be 4000 bytes (or more!) and trigger the problems we’re seeing above, we have to put some “decorations” on our data so that each side can parse it and structure it accordingly.

It’s common (and has been for decades) to define protocols to do this using a line-oriented format of headers and data. So, for example, instead of just sending JSON data we can send something that looks like this:

Data-length: 37
["this is a JSON list of one string"]

In this case we have a simple header “Data-length: 37” followed by (if wordpress hasn’t mangled it too badly) exactly 37 characters which are a JSON presentation of a list containing one string.

Now instead of just calling recv() and trying to read the entire message, we would read data line-by-line, parse the fields as they come in, and we can loop over multiple recv() calls if necessary because the “Data-length” header tells us how many bytes to expect after that.

If you are paying attention, you’ll realize there are still problems lurking here. Even with this “send a line telling us how long the data will be” protocol, it is still conceptually possible that even that line, no matter how short it is, might be broken up into multiple TCP segments. In other words, just because this is short:

Data-length: 37

doesn’t mean we will always get it in one read() operation when we are using a TCP stream. Fortunately python (as do many other languages) provides io libraries that essentially read an input stream character by character. That low level code has routines such as “readline” that go character-by-character (waiting for additional input from the TCP stream as necessary) building up a string into a “line” until a newline character is reached. This means we can write our code in terms of “read one line” and not worry about TCP segmentation; the newline character in effect becomes an in-band message boundary character that we can parse out and reconstruct the line abstraction from the raw TCP byte stream.

So we could go off and (carefully) write a bunch of code to implement this sort of protocol on top of TCP, and in fact I’ve done that as an exercise, but pretty soon it becomes apparent that we are in essence re-inventing something that already exists… HTTP requests, and especially HTTP POST requests of (perhaps) JSON encoded application data.

At which point “just go install one of those big REST frameworks” is a plausible answer to my question. Nevertheless, I decided to try to implement a smaller example of HTTP/POST transport for this if for no other reason than as an interesting learning exercise.

In fact python has some modules that make implementing a special purpose HTTP server pretty simple. The http.server module includes an HTTPServer class that allows you to pretty quickly set up a trivial server. Here is a server that just prints out what path name it received and returns an HTTP OK response (with no other data) the client:

from http.server import HTTPServer
from http.server import BaseHTTPRequestHandler
from http.server import HTTPStatus

class MyRequestHandler(BaseHTTPRequestHandler):
  def do_GET(self):
    print("GET request on {}".format(self.path))
    self.send_response(HTTPStatus.OK)
    self.end_headers()

server = HTTPServer(('', 12345), MyRequestHandler)
server.serve_forever()

Running this establishes an HTTP server on port 12345 and you can play with it by sending it GET requests using something like curl:

In window A:

% python3 theabovecode.py

In window B:

% curl localhost:12345/testpath

and you’ll see output indicating that the server received a request for path “/testpath”

In my next post I’ll write up how to use this framework to process an HTTP POST request of JSON-encoded data to/from a trivial server.

Random Python Performance Posting

I’ve been writing some very elementary probability simulations in python and wanted to simulate millions of coin flips. I started out with this somewhat-obvious code:

    ht = [0, 0]
    for i in range(n):
        ht[random.choice((0, 1))] += 1

which counts heads/tails in ht[0] and ht[1] and performs n flips.

Can this code be sped up? Inserting time.time() calls before/after the “for” loop I found that this code can perform 10 million flips in 11.8 seconds. So it’s about 1.18 microseconds per flip.

[ aside: all measurements were made on my MacBook Air. I tried to avoid doing anything else while running the timing tests, and in all cases took the fastest result I could get. All results for a given method were similar and were run multiple times at multiple different points in time, so I believe there were no confounding factors with background activities. Your mileage may vary. ]

Borrowing a trick from the old geezer days, I thought “surely if we unroll the loop this will go faster”.  So I changed it to “for i in range(n//100)” and then put 100 of the “ht[random.choice(…” statements in the body of the loop.

Result? 1.16 microseconds per flip. Ok, that’s faster, but by less than 1%. Kudos to python for having efficient looping, plus, of course, the random() call dominates the cost anyway.

Ok maybe random.choice is slower than explicitly asking for a random number:

    ht = [0, 0]
    for i in range(n):
	ht[random.randrange(2)] += 1

Nope, that came in significantly slower! 1.33 microseconds per flip.

Maybe I could speed it up by eliminating array access and just adding up +1 and -1 values in a loop and reconstructing the heads vs. tails numbers after:

    delta = 0
    for i in range(n):
	delta += random.choice((-1, 1))  

That ran as fast as 1.14 microseconds per flip, but still just an insignificant improvement over the the original and arguably more “obvious” code.

Of course, by this point readers are screaming at me “duh, all the time is in the random function and there’s nothing you can do about it” and they are right. I should have tested that first. For example, this code:

    ht = [0, 0]
    for i in range(n):
        ht[1] += 1

runs at 0.11 microseconds per “flip” (albeit there is no flip, it just always adds to ht[1] in this example). Or if we go all the way to the delta method, it will be as little as 0.06 microseconds per “flip” for this:

    delta = 0
    for i in range(n):
        delta += 1

obviously there is no “flip” here but it’s fast.

Of course, we can use a single call to random to generate more than one random bit at a time. In C we might generate a 32-bit random number and then peel off the bits one by one. I’m guessing that doing this in python won’t be as efficient, but there’s still a way to easily amortize one call to the random method over multiple flips. For example we can enumerate all sixteen possibilities for four flips. In binary, they would be: 0000, 0001, 0010, 0011, … 1100 1101 1110 1111. We can note that the first (“0000”) is 4 heads (arbitrarily calling the 0 heads, it doesn’t matter which) and 0 tails. The next one (“0001”) is 3 heads and 1 tails, and so on. The complete enumeration of the outcomes looks like this:

    fourflips = ((4, 0),    # 0000                                              
                 (3, 1),    # 0001                                              
                 (3, 1),    # 0010                                              
                 (2, 2),    # 0011                                              
                 (3, 1),    # 0100                                              
                 (2, 2),    # 0101                                              
                 (2, 2),    # 0110                                              
                 (1, 3),    # 0111                                              
                 (3, 1),    # 1000                                              
                 (2, 2),    # 1001                                              
                 (2, 2),    # 1010                                              
                 (1, 3),    # 1011                                              
                 (2, 2),    # 1100                                              
                 (1, 3),    # 1101                                              
                 (1, 3),    # 1110                                              
                 (0, 4))    # 1111           

and now our loop could look like this:

    ht = [0, 0]
    for i in range(n//4):
        flips = random.choice(fourflips)
        ht[0] += flips[0]
        ht[1] += flips[1]

Running this code we get 0.330 microseconds per flip. About three and a half times as fast as the original code – which makes sense, because we are calling the random function one fourth as many times and that call dominates the timing.

Of course, in doing this optimization I’ve lost the data on individual flips and now I only know the relative number of heads and tails for the four flips that are grouped together. In other words, I can no longer distinguish between a group of four flips that were “HTHT” vs “HHTT” or many other combinations that will all appear as just (2, 2) when returned by random.choice. In my application this is unimportant. If it were important, I could of course have stored the individual flips in the fourflips variable, constructing it this way:

    fourflips = (
            (0, 0, 0, 0),
            (0, 0, 0, 1),
            (0, 0, 1, 0),
     ... etc

and processing it accordingly. I stuck with the aggregated counts because that’s all I cared about in my simulations.

We can take this one level further and combine the running delta method (+1 for a heads, -1 for a tails, and reconstruct the final number of heads and tails from the total number of flips and the delta), plus we can compute the delta ahead of time for each element of the fourflips list (e.g., (0, 4) is a delta of -4, (2, 2) is a delta of 0, etc). Combining those concepts we get:

    ht = [0, 0]
    fourdeltas = (4, 2, 2, 0, 2, 0, 0, -2,
                  2, 0, 0, -2, 0, -2, -2, -4)
    delta = 0
    for i in range(n//4):
        delta += random.choice(fourdeltas)

which gets us all the way down to 0.285 microseconds per flip.

How far can this optimization go? As far as you have patience for and as large as you want the list of deltas to be. I tested the same idea at five flips precomputed as deltas (i.e., five bits at a time) and got down to 0.232 microseconds per flip. There’s a diminishing return for each additional bit added – at five deltas we’re calling the random function 20% as often as the original code, but at four deltas we were already down to 25% so the improvement isn’t as dramatic. Also the deltas list doubles in size each time though of course that can be precomputed by a program so it’s not clear that there’s a pragmatic limit. I don’t know if at some point the length of the deltas list makes random.choice execute slower; I don’t think it would but it depends on implementation details.

For my purposes I was happy at five bits per call to random.choice and 0.232 microseconds per flip, which is more or less five times faster than the original.

 

Correlation vs. Causation

In case you needed any more evidence of the link between fossil fuel imports (at least, from Norway) and harm to people’s health, I present you: