## Voronoi Diagram Problems on a Grid

Following up on my previous post about Voronoi diagrams on finite grids.

By a “finite grid” I mean something like this that you might write for a game:

```PSEUDO-CODE (not in any particular language syntax)
# create a 100 x 100 map of the world
for x in 0 .. 99
for y in 0 .. 99
grid[x, y] = SEA

# now make some LAND
... (some algorithm goes here) ...
```

I wrote some code to create Voronoi cells from random seeds in such a grid, and ran into some odd situations as described in that previous post.

Today I figured out that it is impossible to create a finite-grid Voronoi diagram that is “correct” in the sense that it meets both of these criteria:

VORONOI CORRECTNESS CRITERIA:

1. All grid squares are assigned to their closest Voronoi seed.
2. All Voronoi cells are contiguous

Criterion #1 simply means that each grid square is assigned to the “correct” Voronoi cell. This seems pretty basic and obvious.

Criterion #2 simply means that each Voronoi cell is one contiguous “blob” of points, that it doesn’t have any isolated points elsewhere in the map that are not connected to it.

As it turns out you can’t guarantee that both of those will be true. There are situations where you will have to choose either between violating #1 and assigning a grid square to the “wrong” Voronoi cell, or violating #2 and assigning a grid square to the “correct” Voronoi cell even though it does not have any immediate neighbors (including diagonally) that are in that cell.

This surprised me until I understood it; then I realized it was obvious. It is an “obvious” two-dimensional implication of sampling and aliasing problems that are to be expected when you take a continuous domain (the actual Voronoi diagram) and sample it into a digital (discrete) domain. High frequency components will introduce aliasing errors in this situation, and in this case the definition of a “high frequency component” amounts to Voronoi cells that are “too pointy.”

Here’s an example you can plug into any Voronoi generation code:

```Grid size: 25 x 25
Voronoi sites: (0, 6), (4, 3), (0, 12)```

I plugged these into the Matlab Voronoi code, which computes a continuous solution and got this output:

The three Voronoi sites are shown: A=(0,6), B=(4,3), C=(12,0) and the entirety of the diagram is divided into three Voronoi cells accordingly.

Let’s zoom in where the three cells meet:

I’ve put coordinates on this zoom-in that would correspond to a finite grid using 1 unit squares, but in the drawing above they are subdivided into quarter squares that are each a half unit in size (in each of x and y). Thus each of our “integer grid” squares consists of four of these half-unit squares, centered on its coordinate. This diagram which zooms in even more illustrates that:

And now we can clearly see the dilemma. Look at the four colored grid-squares:

• RED: Centered on (15, 23)
• ORANGE: Centered on (16, 23)
• YELLOW: Centered on (15, 22)
• GREEN: Centered on (16, 22)

The Voronoi computation for each of these squares, computed at their center point (because, by definition, this is how we compute them in a finite grid), will assign them as follows:

• RED: A
• ORANGE: B
• YELLOW: A
• GREEN: C

You can run these calculations yourself if you want to verify what seems obvious from the picture. For example, for the YELLOW square at (15, 22) those calculations look like this:

```(P.x, P.y) = (15, 22)
Distance (squared) to A=(0, 6):
deltaX**2 = (15 - 0)**2 = 225
deltaY**2 = (22 - 6)**2 = 256
Distance (squared) to site A = 481
.. using same method ..
Distance (squared) to site B = 482
Distance (squared) to site C = 493```

If we run these calculations on all the grid squares (1-unit by 1-unit squares centered on the integer coordinates in that picture) we’d get this diagram for assigning the squares to cells:

For the 3 x 3 grid worked out in this diagram (remember: grid squares are centered on their integer coordinates):

• Five yellow squares are all assigned to Voronoi A.
• Two green squares are assigned to Voronoi C.
• Two purple squares, which do not touch each other, are assigned to Voronoi B.

There is no avoiding this. If we zoomed back out we’d see a large contiguous yellow cell for Voronoi A, a large contiguous green cell for Voronoi C, and a large, discontiguous, purple cell for Voronoi B.

The issue is rooted in the resolution (1 unit, in this example) of the finite grid we have imposed on top of a continuous-domain Voronoi diagram. Voronoi cell B is simply “too pointy” — in effect it has frequency components that are higher than the Nyquist rate. Or another way of thinking about it is that it is pointy enough so that it can reach the coordinate (16, 23) while dodging all of the immediately-neighboring (integer grid) coordinates.

There is good news and bad news here. The bad news is – there is no way to ensure that a Voronoi diagram computed on a finite-resolution grid will be “correct” if by “correct” we mean: obeys the VORONOI CORRECTNESS CRITERIA I listed up top.

The “good” news, if you think of it this way, is that this means that any approximation that runs efficiently and gets things right within reason can be considered “good enough” because fundamentally there is no one right way to do this computation anyway.

So, finally, bringing this back to the problem that sent me down this rabbit hole (in the midst of the enclosing Voronoi rabbit hole): My “expanding rings” algorithm faces a dilemma when trying to decide which Voronoi cell to choose for a grid square when two cells are equidistant from that grid square. And as I wrote in the other post, I was concerned that making the “wrong” choice might cut off the perimeter of an active Voronoi cell prematurely, leading to possible mischaracterizations of later cells. And I started searching for smart ways to determine what the “right” choice for those instances would be, to avoid causing wrong Voronoi assignments later.

That was turning out to be difficult as my random test generator kept finding cases that didn’t work out no matter what additional algorithm I tried for assigning the equidistant cases. Now I know that the underlying problem for at least some of those random test cases that got generated is this problem described here: sometimes there literally is no choice that is “correct” (that meets both of those correctness criteria laid out).

So, we’re back to the easy answer: “Who cares!” ðŸ™‚

## Voronoi Diagrams

Day 2 of COVID-19 social distancing. All of my meetings and travel for the upcoming week are cancelled, my calendar is now strangely empty, and I could probably make use of this extra time to do something wildly productive.

Instead, I have randomly dived into this particular rabbit hole:

https://en.wikipedia.org/wiki/Voronoi_diagram

Read the Wiki page for the real details, but in summary: take a set of points called “seeds” (or sites). For each seed, the corresponding Voronoi cell is the set of points that are closer to that seed than to any of the other seeds. That set will be a polygon around the seed (and every point in that polygon will be closer to its own seed than to the seed of any other polygon).

An example Voronoi diagram:

Let’s say we are writing a map generator for a “world consisting of land and sea grid squares” game. We want to generate random blobs of land and sea that seem plausibly arranged. Voronoi diagrams are one way to do that – create some random Voronoi cells and choose a subset of them to be land and the rest to be sea. You can imagine how that would work using the above picture and visualizing some of those polygons as being land and the rest sea. In practice you would probably want more individual cells (i.e., finer-grained), so that as they clump together when randomly selected they look less obviously like the Voronoi polygons that they are. You might also still need to do additional post-processing on the results; you can go down another rabbit hole googling this topic. The Voronoi idea just gets you close as a starting point. But I digress… so: back to the Voronoi computation.

The “gold standard” algorithm for generating a Voronoi diagram appears to be Fortune’s algorithm.

https://en.wikipedia.org/wiki/Fortune%27s_algorithm

And of course the right answer to most questions of “how to implement something in python” is “google it; because five people have probably already done it” and that is definitely the case here.

But there’s time to kill and this seems like a fun project so I decided to play with these diagrams. I did not (or have not, yet) tried implementing Fortune’s algorithm. It is complicated and requires study. Instead I plowed ahead with the somewhat-obvious “brute force” approach:

```Let G be a grid
Each G[x, y] being a point a.k.a "grid square"

Let S be a list of points
Each S[i] being the (x, y) point for the "seed" of a cell

For each point, pt, in grid G:
For each site, s, in list S:
compute distance between pt and s
assign pt to the closest Voronoi site```

This works. It is not fast.

This brute-force algorithm has order nÂ² performance characteristics, where n is the number of Voronoi cells. That analysis depends on exactly how you define n and how you think about the parameters, in particular how many Voronoi sites will be seeded into a grid of a given size.

In the “world map” example, this analysis definitely holds because if you choose a larger map you will still want the same average Voronoi polygon size, just more of them. The alternative – same number of Voronoi cells, just let them grow larger as the map grows larger, makes little sense in this example. So: Bigger grid, more sites. Thus, a larger grid means both more points to evaluate, and more Voronoi sites to evaluate them against, leading to the nÂ² performance characteristic, which gets slow very quickly as the grids get larger.

## Aliasing

As I started coding (and testing!) I discovered some interesting problems I haven’t seen written up anywhere (though perhaps my google-fu was just weak).

In a quantized grid of discrete x, y (integer) points, there are configurations where a given grid point could be a member of two different Voronoi cells. Here’s a very simple/obvious example of that:

```  2     ?   b   B
1     a   ?   b
y=0     A   a   ?

x=0   1    2```

In this diagram, upper-case A‘ at coordinate (0, 0) is one Voronoi site, and upper-case ‘B‘ at coordinate (2, 2) is another. The two lower-case ‘a‘ points are at (Euclidean) distance 1 from A and clearly belong in its Voronoi, as the ‘b‘ belong to B

But the three points marked with ‘?‘ are equidistant from both A and B. How should they be assigned?

```a b B      b b B      a b B
a a b  OR  a b b  OR  a b b OR ...
A a a      A a b      A a a```

“Who cares?” is a perfectly reasonable response to this question. But it came up for me when I wrote an alternate implementation and made test cases to compare the results of the two, which did not compare equal in simple cases like these but also in some surprisingly complicated cases. This prompted me to study the issue a bit deeper.

Fortune’s algorithm sidesteps the issue because it solves for the vertices of polygons. There is an exact solution for those equations, independent of any grid considerations. In the above examples each of the two Voronoi cells are triangles, with their vertices at the obvious corners. The vertices of the polygons are lines with continuous slope/intercept equations, and they uniquely and precisely define the Voronoi diagram. It is only when we take that continuous domain and map it (“sample” it) onto a grid that we get the aliasing (“jaggies”) problem. You can take yourself down the Nyquist-Shannon sampling theorem google rabbit hole yourselves if you want to at this point. ðŸ™‚

So, for example:

This depicts the true Voronoi polygons, each with three vertices (color-coded circles, shared vertices colored half-and-half). It also shows the integer grid imposed on top of this continuous figure. This is just another way of showing the aliasing problem caused by sampling the continuous Voronoi polygon solution onto the finite (blue) grid.

These trivial examples don’t seem very important, but when I tried writing a better/faster but still “brute force” algorithm I ran into more problems.

## Growing Circles Algorithm

The next idea is a step up from brute-force: grow each Voronoi cell outward cell-by-cell, doing them all “in parallel” (conceptually, via algorithm logic not “real” parallelism). The idea is that grid points near their “obvious” Voronoi seed can be assigned to that cell without needing to be checked against other cells that are obviously too far away.

Let Dsq(P1, P2) (“squared distance”) between two points P1 and P2 be defined as:

```dX = P1.x - P2.x
dY = P1.y - P2.y
Dsq = (dX*dX) + (dY*dY)```

This is just a standard Euclidean distance calculation with the final sqrt step omitted. That step is unnecessary for this algorithm, so I leave it out (faster/simpler).

Next define a “ring” Rs(N) of squares around a Voronoi site, defined as the set of points that are squared-distance N away from Voronoi site S.

This diagram shows two Voronoi cells and their surrounding grid points marked with the Rs values. There is a separate Rs function for each Voronoi site; in the diagram the results are color-coded to match the corresponding sites.

For any given grid point, the lower Rs value is the one that matters, and assigns that point to the corresponding Voronoi site. The diagram shows higher values in (ironically) smaller font for some of the points where the cells begin to touch.

Starting with N=1 the algorithm builds out these “rings” (they are fragments of rings) from each Voronoi site. Look at the grid point at the top directly above B, labeled 9 in green (Rb N-value) and 10 in red (Ra N-value). When we create the Rs(9) rings for all the Voronoi sites, that point will get labeled with a 9 for the B Rs(N) function. Even though we can see later it will be right at the boundary of A and B, since we have discovered it while working on all the 9’s (and below), we can safely assign it to B, because we know by definition that if no other Voronoi site has this in its list of 9 (or below) then the closest it could be to any other site is 10 or more. We can “claim” it for green in this example, without ever having computed how far away it is from red.

Each time a point is “claimed”, the unclaimed perimeter squares of that point become candidates to be evaluated at the next N value. Said differently, each Voronoi site has an “active perimeter” which are all the unclaimed points immediately neighboring the points that have been claimed in prior iterations. It is entirely possible for a single point to end up in the active perimeter of more than one Voronoi cell at a time, which will happen as the growing blobs around each cell start to approach each other. Whichever cell gets there first (in terms of the squared-distance N value being iterated) will win that point.

Thus the key idea here being that for most points they never have to be evaluated against more than one, or maybe just a few (more on this in a bit) Voronoi sites. Instead of computing the square distance from every point to every Voronoi cell, we only compute the distances for the active perimeters of each Voronoi cell separately, and let them grow outwards unmolested until they begin to touch either other. Before they touch each other there is no chance that some other Voronoi cell would be closer; if it were, then the point in question would already be in its active perimeter.

The simplest way to understand this algorithm is to imagine N going up by one each iteration, but there are ways to optimize that to go up to the “next” N value that matters. Perhaps more on that later.

Of course there’s still the same equidistant point problem. Equidistant points show up any time two Voronoi cells are directly on a horizontal, vertical, or diagonal line with each other AND there is an odd number of points along that line between them. The simple 3×3 red/green example makes this somewhat obvious, as the “middle” grid points have to be divided in half between the two competing Voronoi cells. When there are an even number of grid points along one of these “straight” lines, the red/green border falls between two points so there is no ambiguity. It only falls into the middle of a grid point when the total number of grid points along this connecting line is odd.

However, there are other combinations that lead to equidistant points. The fact that each Dsq value computed is (by definition) the sum of two squares means we can look for this programmatically, by finding integer values that can be the sum of two (or more) pairs of squares, that is:

`N = (a*a) + (b*b) = (c*c) + (d*d)`

for some combination of integers a, b, c, and d, excluding the trivial case where the pair (a, b) is identical to (c, d) or (d, c). So what are some examples of these numbers? Well, as with most of those types of questions, the Online Encyclopedia of Integer Sequences has the answer, or at least part of it: Sequence A007692:

http://oeis.org/A007692

and the first few numbers that meet this criterion are 50, 65, 85, 125, 130, 145; however, I say “part of” the answer because that sequence excludes the case where one of the terms is allowed to be zero. I couldn’t find an actual OEIS sequence corresponding to “allow zeros” in the above sequence, but any pythagorean triple will form one of these ambiguous contenders with one of the terms being allowed to be zero, because:

```If a**2 + b**2 == c**2 then
a**2 + b**2 == c**2 + (0)**2

and (a, b) , (c, 0) will be a pair of two different pairs
whose squares will sum to the same integer (c**2)```

So for pythagorean triple 3, 4, 5 we see that 25 is another such value, with the corresponding pairs being (3, 4) and (5, 0).

Let’s take a look at 50: it is 25+25 and also 49+1. That means it should happen on a grid point that is deltaX=5, deltaY=5 from one Voronoi cell, and deltaX=7, deltaY=1 from another. Drawing that, with the square-distance values, results in:

This illustrates three different types of equidistant grid points. The one labeled 10 (in both red and green) is at delta (1, 3) from A and delta (3, 1) from B. This is the simplest scenario for creating equidistant squares – when they have the same “shape” of their distance from their corresponding Voronoi site.

The one labeled 25 is a pythagorean example; it is (3, 4) from A and (5, 0) from B.

Finally, Square F, which has square-distance of 50 to both A and B, is an example of something else entirely – it is (5, 5) from A and (7, 1) from B.

Things get weird (and very hard to draw) as the numbers get larger. I ran a test case with these parameters:

```# three Voronoi sites
S = [ (19, 0), (13, 15), (18, 7) ]```

and then computed the squared-distance for every point in a 20×16 grid. The result for the lower left (starting at 0, 0) region looks like this stopping at N=324:

```  269b 244b 221b 200b 181b 164b 148c 125c
290b 265b 242b 221b 202b 178c 153c 130c
313b 288b 265b 241c 212c 185c 160c 137c
.  313b 281c 250c 221c 194c 169c 146c
.    .  290a 257a 226a 197a 170a 145a
.  324a 289a 256a 225a 196a 169a 144a ```

The lower case letters ‘a’, ‘b’, and ‘c’ correspond to the three Voronoi cells at A=(19, 0), B=(13, 15) and C=(18, 7). Their origins in this table are off to the right (and up); here is the full chart in smaller font:

The square at (1, 1), which hasn’t been computed yet in the above, has squared-distance value 325 from both A and C:

```A=(19, 0), C=(18, 7), P=(1, 1)
dxA = (19 - 1) = 18
dyA = (0 - 1) -1
Na = dxA**2 + dyA**2 = 325

dxC = (18 - 1) = 17
dyC = (7 - 1) = 6
Nc = dxC**2 + dyC**2 = 325```

So as already belabored to death, the point at (1, 1) has to be randomly assigned to A or C and it “doesn’t matter”. Or so we think.

Say we assign the point to A. Notice that this “cuts off” C, so there are no more active perimeter points for C after choosing to give this point to A.

Now let the calculation run up to 359 and look again, after assigning that point to A:

```  269b 244b 221b 200b 181b 164b 148c 125c
290b 265b 242b 221b 202b 178c 153c 130c
313b 288b 265b 241c 212c 185c 160c 137c
338b 313b 281c 250c 221c 194c 169c 146c
.  325a 290a 257a 226a 197a 170a 145a
.  324a 289a 256a 225a 196a 169a 144a ```

Now the problem: the point at (0, 1) is distance 362 from A but is 360 from … (wait for it) … C!

Oooops, we aren’t even checking C any more because it got “cut off” when we arbitrarily chose to assign (1, 1) to A.

I stumbled into this problem by writing test code to generate random Voronoi sites, then running my “grow the blobs” algorithm and testing it against the brute-force implementation, with allowance as necessary for squares that might be equidistant from two different Voronoi cells. The test code very quickly generated test cases like this one where the arbitrary choice for an earlier equidistant point turned out to have implications on later choices and could cause a point to be assigned to the wrong Voronoi.

I spent quite a bit of time trying to understand this problem to see if there was any efficient way to fix it. The first thing that comes to mind is a heuristic that says “never close off the perimeter of a Voronoi cell if you don’t have to” which, in this particular case, would argue for keeping C alive and cause the correct result to come about.

I implemented that heuristic and it fixes this case, but it does not seem to fix the general case and I haven’t done enough digging into it yet to figure out why. I then tried generalizing the heuristic to assign the equidistant point to the Voronoi cell with the fewest active perimeter squares in the 3×3 neighborhood. This too does not address all cases (nor did a 5×5 neighborhood). I’m not sure if there is a generalized (and practical) algorithm that will solve for this or not yet.

I note that even with the original pure brute-force algorithm, there are bugs here depending on how you look at it. The brute force algorithm faces the same decision regarding the (1, 1) “325” point – assign it to A, or C? If the brute force algorithm assigns it to A, it will at least still find the correct C assignment for the (0, 1) point – 360 distance to C and 362 distance to A, so it will assign it to C. But in that case the brute force algorithm will have created a non-contiguous Voronoi cell, which seems like a bug (indeed this makes me wonder if some of the bug reports I saw on some other implementations when I was googling are related to this problem).

Well, it’s getting late and it’s time for a Quarantini beverage … so I’m stopping here and posting this and hoping to come back to it again with more details once I have done more work on this aliasing problem and its implications.

## python: Decoding multiple JSON strings in a file

I wanted to be able to decode a stream that had multiple, undelineated, JSON string representations in it. For example, a file like this:

[1, 2, 3] [4, 5, 6] [7, 8] [9, 10, 11, 12]

Which has multiple JSON arrays in it, separated only by (optional) whitespace. Perhaps they might be separated by other characters in other applications (e.g., comma separated JSON strings)

First stop, as always, was stackoverflow, which had some relevant answers, like this one:

I decided to write a variant of the several “read it in chunks, try to JSONDecode it, keep reading more if it doesn’t yet parse” solutions. Here is my take at this problem, as a gist:

https://gist.github.com/outofmbufs/bcf3bccc1fe0bb0824871ec5e02cc60e

Posting this in the hopes, as always, that someday someone will be looking for something like this and stumble across it via google (if I’ve somehow worked enough key phrases into this posting)

## Coin flip simulation still running – 62 days!

The coin flip experiment I described before is still running… so far for 62 days. It is running on a dedicated linux box on my network:

```nw@randomlinux\$ uname -a
Linux randomlinux 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1 (2018-04-29) x86_64 GNU/Linux

nw@randomlinux:~\$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 76
Model name:            Intel(R) Atom(TM) x5-Z8350  CPU @ 1.44GHz
Stepping:              4
CPU MHz:               482.695
CPU max MHz:           1920.0000
CPU min MHz:           480.0000
BogoMIPS:              2880.00
Virtualization:        VT-x
L1d cache:             24K
L1i cache:             32K
L2 cache:              1024K
NUMA node0 CPU(s):     0-3

```

and the updated results are:

• Trials Completed: 5,008,009
• Total Coin Flips: 8,604,595,583,786
• Longest Trial: 2,785,974,537,774
• Trials Ending in 2 flips: 2,504,265 (50.005%)

Updated breakdown of the percentages of trials completed in N flips, and the corresponding prediction:

length (flips) actual % theoretical %
2 50.005% 50.00%
4 12.503% 12.50%
6 6.244% 6.25%
8 3.903% 3.906%
10 2.736% 2.734%
12 2.062% 2.051%
14 1.610% 1.611%
16 1.316% 1.309%

I’m letting this run until the next power failure terminates the experiment. Unfortunately the little box this is running on is not set up on a UPS. My power typically glitches (short outages) a few times each year (I’ve written about this before) so I don’t know how long to expect this run to get, but as long as it is still running I will keep collecting the data. Let’s hope for no thunderstorms in the hilltop area any time soon.

## Coin flips and Catalan Numbers

On average, how many times do you have to flip a (fair) coin to get an equal number of heads and tails?

and suggested that the answer was probably infinite (on average), although half the time you get there in just two flips: HT or TH, which are half the possible cases for two flips: HH, HT, TH, TT.

To review: there are 4 possible combinations of two flips: HH, HT, TH, TT. Each is equally likely. Two of them – HT and TH – reach equal heads/tails equality. Let’s call this “reaching delta zero” where delta refers to the difference in heads vs tails count. Since two out of four equally-likely sequences reach delta zero, there is a 50% chance of getting there in just two flips.

The next stage has four flips (three flips will never have delta zero), and there are sixteen possibilities for four flips. However, half of them were eliminated at two flips, leaving only the sequences that start with HH or TT:

```HHHH, HHHT, HHTH, HHTT
TTHH, TTHT, TTTH, TTTT```

Two of these sequences reach delta zero: HHTT and TTHH. So there is a 2 out of 8 ( i.e., 1 out of 4) chance of reaching delta zero at four flips, except we have to account for the 50% chance of never even getting to four flips, so there is a 1 out 8 = 12.5% chance of finishing at four flips.

In the prior post I discussed the catalan numbers and how they relate to the number of these sequences. From that discussion we can compute the probabilities of a given trial run ending (reaching delta zero) at any particular stage. This works out as follows:

```# FLIPS      % reaching equality here
2            50%
4            12.5%
6             6.25%
8             3.91%
10             2.73%
12             2.05%
14             1.61%
```

Unlike the chart shown in the previous posting, these numbers are the chance of the trial ending at the specific level (e.g., a 3.91% chance any given trial run will take exactly 8 flips, no more / no fewer).

I wrote a simulation in python to test this, and I have been running it for over three weeks. Highlights of key results:

• Trials Completed: 3,414,318. This is how many runs were made where each run consisted of python code along these lines (except the actual code used some unrolling and optimization techniques I’ve written aboutÂ  before). This code is essentially simulating flipping a coin and counting the “delta” between the number of heads and tails, so each flip either adds one or subtracts one from that running delta and equality of heads and tails is reached when the delta becomes zero.
```delta = random.choice((-1, 1))
flips = 1
while delta != 0:
delta += random.choice((-1, 1))
flips += 1```
• Total number of coin flips: 3,197,243,199,696. Yes, you are reading that correctly, over 3 trillion simulated flips. I will point out that python, like many (most? all?) runtimes these days uses the Mersenne Twister random number generator with period 2**19937-1; no older style random number generators that have much much shorter periods would be suitable for this sort of experiment.
• Longest run: 1,204,531,561,152. Yes, that’s another “you read that right”. The longest string of coin flips it took to reach equality was a little over 1.2 trillion. That is, after the first flip generated a delta of +1 or -1, in that sequence it wasn’t until 1.2 trillion flips later that the delta was back down to zero.
• Runs ending at 2 flips: 1,707,887 (50.02%). This is as we predicted. experimental results error is 728 out of over 3.4M trials, which comes to a difference of about 213 ppm (or however you want to think of that difference from the perfect 50%).

Here’s a breakdown of the percentages of trials completed in N flips, and the corresponding prediction:

length (flips) actual % theoretical %
2 50.02% 50.00%
4 12.498% 12.50%
6 6.251% 6.25%
8 3.901% 3.906%
10 2.739% 2.734%
12 2.061% 2.051%
14 1.613% 1.611%
16 1.312% 1.309%

I found this write-up nicely describing the math of this and in particular analyzing that the average will indeed diverge to infinity.

The results of seemingly-simple probability questions are often counter-intuitive. “How long will it take me to get an equal number of heads and tails while flipping a coin?” and the answer is “fairly often it could take a very very long time”

## Flipping a coin until reaching equal heads and tails

Today’s question is: how many coin flips, on average, would we have to do if we keep flipping a coin until the number of heads equals the number of tails?

Obviously it takes at least two, and might take many more. After two flips there are four possible outcomes:

2. HT: Heads on flip 1, tails on flip 2.
3. TH: Tails on flip 1, heads on flip 2.
4. TT: Tails on flip 1, tails on flip 2.

From basic probability theory (which you should go review if necessary), each of these four outcomes is equally likely. Two out of the four of them have an equal number of heads and tails; therefore, we know that 50% of the time we try: “Flip a coin until the number of heads and tails are equal” we will be done in two flips. But the other 50% of the time we will have to keep going.

How much longer in that other 50% of the time? The first thing that should be obvious is that there will have to be an even number of flips. There will never be an equal number of heads and tails in 3 flips, or 5 flips, or any odd number of flips.

So we have to do at least 2 more flips 50% of the time we start out (i.e., whenever we start out with HH or TT).

Starting with HH the next string of possibilities is:

1. HH HH
2. HH HT
3. HH TH
4. HH TT – this one reaches equal heads/tails

Similarly, starting with TT:

1. TT HH – this one reaches equal heads/tails.
2. TT HT
3. TT TH
4. TT TT

This time there were eight possibilities, and only 2 of them reach equality. Note that even though there are four flips of the coin there aren’t 16 possibilities, because eight of the sequences start with either HT or TH and therefore reach equality of heads and tails at 2 flips. Or looked at another way, those branches of the 2**n tree (2 possibilities for 1 flip, 4 possibilities for 2 flips,Â  8 for 3 flips, 16 for 4 flips, etc) were pruned off.

To see this explicitly, here is the complete list of all 16 possible sequences of 4 flips:

```HHHH
HHHT
HHTH
HHTT

HTHH ** Never happens; equality after HT start
HTHT **    ...
HTTH **    ...
HTTT **    ...

THHH ** Never happens; equality after TH start
THHT **    ...
THTH **    ...
THTT **    ...

TTHH
TTHT
TTTH
TTTT
```

So, in review:

• 50% of the time 2 flips is enough.
• In the remaining 50% case:
• 25% of the time (2 cases out of 8 as shown above) 4 flips is enough.
• 75% of the time (6 cases out of 8) more than 4 will be needed.

We can start to do some math:

```Percent of time reach equality in 4 flips or fewer:
50% in 2 flips
25% of 50% = 12.5% in 4 flips
75% of 50% = 37.5% takes more than 4 flips
```

We can now say 62.5% (50% + 12.5%) of the time it will take 4 flips or fewer (i.e., 2 or 4) to get to equality, and 37.5% of the time it will take more than 4.

How many more? Carrying this case-by-case analysis out further gets messy quickly. Fortunately this problem is related to something called a Catalan number. A description of them can be found in the Online Encyclopedia of Integer Sequences (A000108). There is also a wikipedia entry which I paraphrase here:

There are many counting problems in combinatorics whose solution is given by the Catalan numbers.

C(n) is the number of Dyck words of length 2n. A Dyck word is a string consisting of n X’s and n Y’s such that no initial segment of the string has more Y’s than X’s.

For example, the following are the Dyck words of length 6:
XXXYYY XYXXYY XYXYXY XXYYXY XXYXYY.

The way they describe that is a bit confusing, plus we are actually interested in both XXXYYY (HHHTTT) and YYYXXX (TTTHHH). Here’s another description from wikipedia:

C(n) is the number of ways to form a “mountain range” with n upstrokes and n downstrokes that all stay above a horizontal line. The mountain range interpretation is that the mountains will never go below the horizon.

and this accompanying picture:

This is better and more directly illustrates why the numbers we need are two times the Catalan number, because we are interested in both mountain ranges that start by going up first (e.g., first flip is heads and eventually we need to get back “down” to equal heads/tails) or by going down first (e.g., first flip is tails and eventually we need to get back “up” to equal heads/tails).

Here is a listing of all possible sequences of 2 flips, 4 flips, 6 flips, 8 flips, and 10 flips that end up with an equal number of heads and tails:

```# 2 sequences of 2 flips: 2 * Catalan(n=0)
HT
TH

# 2 sequences of 4 flips: 2 * Catalan(n=1)
HHTT
TTHH

# 4 sequences of 6 flips: 2 * Catalan(n=2)
HHHTTT
HHTHTT
TTHTHH
TTTHHH

# 10 sequences of 8 flips: 2 * Catalan(n=3)
HHHHTTTT
HHHTHTTT
HHHTTHTT
HHTHHTTT
HHTHTHTT
TTHTHTHH
TTHTTHHH
TTTHHTHH
TTTHTHHH
TTTTHHHH

# 28 sequences of 10 flips: 2 * Catalan(n=4)
HHHHHTTTTT
HHHHTHTTTT
HHHHTTHTTT
HHHHTTTHTT
HHHTHHTTTT
HHHTHTHTTT
HHHTHTTHTT
HHHTTHHTTT
HHHTTHTHTT
HHTHHHTTTT
HHTHHTHTTT
HHTHHTTHTT
HHTHTHHTTT
HHTHTHTHTT
TTHTHTHTHH
TTHTHTTHHH
TTHTTHHTHH
TTHTTHTHHH
TTHTTTHHHH
TTTHHTHTHH
TTTHHTTHHH
TTTHTHHTHH
TTTHTHTHHH
TTTHTTHHHH
TTTTHHHTHH
TTTTHHTHHH
TTTTHTHHHH
TTTTTHHHHH
```

Because we have two variations (heads first, tails first), the number of sequences possible for a given generation (“n” value) is double the catalan number C(n). The formula for the nth Catalan number is:

```# python code. Note: // is integer division
def catalan(n):
return comb(2*n, n) // (n + 1)
```

where comb() is the combinatoric function “m take n”.

Given all this we can compute how many patterns of n flips will result in our experiment finishing with an equal number of heads and tails. First let’s just print out a bunch of Catalan numbers (times 2):

```import math

def comb(m, n):
if m < n:
raise ValueError("requires m >= n")

top = 1
for i in range(n):
top *= m-i

def catalan(n):
return comb(2*n, n) // (n + 1)

for n in range(20):
print("flips={:2d}, 2C(n={})={}".format(
2*(n+1), n, 2*catalan(n)))
```

Running this gives:

```flips= 2, 2C(n=0)=2
flips= 4, 2C(n=1)=2
flips= 6, 2C(n=2)=4
flips= 8, 2C(n=3)=10
flips=10, 2C(n=4)=28
flips=12, 2C(n=5)=84
flips=14, 2C(n=6)=264
flips=16, 2C(n=7)=858
flips=18, 2C(n=8)=2860
flips=20, 2C(n=9)=9724
flips=22, 2C(n=10)=33592
flips=24, 2C(n=11)=117572
flips=26, 2C(n=12)=416024
flips=28, 2C(n=13)=1485800
flips=30, 2C(n=14)=5348880
flips=32, 2C(n=15)=19389690
flips=34, 2C(n=16)=70715340
flips=36, 2C(n=17)=259289580
flips=38, 2C(n=18)=955277400
flips=40, 2C(n=19)=3534526380
```

In the output above, think of n as the generationÂ number where each generation is two flips. Like many things in computers and math we number the generations starting at zero, so the zeroth generation has 2 flips, and there are 2*C(0) == 2 combinations (HT, TH) that reach equality at that generation. Similarly the second generation (n=1) has four flips, and two of the sequences of four flips will reach equality at that generation. We showed this explicitly above by listing all the possibilities. These Catalan numbers become large quickly, confirming that explicit case-by-case analysis is going to get quite difficult. We need to figure out some general math instead.

The next thing we need to figure out is how many of the sequences there will be in each generation. We already know that it is less than 2**flips because some of them are cut off each generation – as was shown when there were only 8 sequences to consider at four flips (instead of 2**4 which would be 16).

If there are q sequences in generation n, the number of them that have equal numbers of heads and tails is given by 2*C(n); those sequences will not continue any further. Each remaining active sequence in generation n becomes four possible sequences in generation n+1, from appending HH, HT, TH, and TT outcomes.

Therefore, if q(n) is the number of active sequences in generation n, the number of sequences in generation n+1 will be:

```q(0) = 4      # by definition
q(n+1) = 4 * (q(n) - 2*C(n))
```

Picking that back apart: q(n) is the number of sequences there will be at generation n, and 2*C(n) is the number of those that will reach equality. Therefore we subtract 2*C(n) from q(n) and that is the number of sequences that need to go for more flips; each of those sequences will expand to four possibilities in the next generation, hence the multiplication by 4.

It’s more convenient to express that equation in terms of the current generation and the previous generation (rather than current and next):

```q(0) = 4
q(n) = 4 * (q(n-1) - 2*C(n-1))
```

As alread noted, some of those sequences will reach equality; for example this formula gives eight sequences at n=1 (four flips). We listed those eight before, and two of them (2*C(1)) reach equality at that generation (so six of them go on to n=2 / six flips).

This is a recursive formula that doesn’t let us compute an arbitrary q(n) without going through all the cases leading up to it, but just for grins lets write some code and print out a list of q(n) values and see if we agree with it:

```for n in range(10):
if n == 0:
q = [4]
else:
q.append(4 * (q[n-1] - (2*catalan(n - 1))))

# q[i] will be the computed value at generation i
print(q)
```

This is some quick-and-dirty code that only works if we request the q values sequentially (as is done in this example), but it’s good enough for now. Running this (with the catalan function as previous defined) results in:

```[4, 8, 24, 80, 280, 1008, 3696, 13728, 51480, 194480]
```

This matches the results we got manually – at n=0 (the very first pair of flips) there are 4 sequences to consider: HH, HT, TH, TT. At n=1, four flips, there are eight sequences to consider and we listed them above. We know, from that list, that 2 of those eight sequences reach equality, leaving 6 “active” sequences for the next generation, and we know that each of those 6 active sequences will turn into four new sequences (from appending HH, HT, TH, and TT) … so there will be 24 sequences to consider at the next generation.

Ok, so what does this tell us about how many times we’re going to have to flip that coin, on average, to get back to equal?

We have all the tools needed to calculate the chance that you reach equality by (or before) generation n. Let E(n) be the probability that we have an equal number of heads and tails by generation n (i.e.,Â  in 2*(n+1) flips). We know, by definition (inspection of all the cases: HH, HT, TH, TT) that E(0) = 0.50 (50%). The probability of reaching equality by arbitrary generation n will be made up of two components:

• the chance you reached equality by or before generation n-1, which is `E(n-1)`
• the chance you had to keep going, which is whatever is left after you take out the E(n-1) chance you already finished; so that is `1 - E(n-1)`, which we will then multiply by the chance that you reach equality on this nth generation.

We know from the above formulas that q(n) is the number of sequences “in consideration” in generation n. We also know that the number of sequences that will reach equality in generation n is given by: 2C(n) (twice the catalan number for n).

Therefore we know the chance you reach equality in a specific nth generation is the number of such sequences that reach equality, divided by the total number of sequences that would be in consideration at that generation:

2C(n)/q(n)

You can verify that yourself with the four flip case (n=1), where there were 8 sequences (q(1) == 8) and 2 of them (C(1) == 1, 2 times that is 2) reached equality, leading to a 25% chance of being done in that case (with that case only happening 50% of the time). Putting all that together we finally get:

E(n) = E(n-1) + (1 – E(n-1)) * 2C(n)/q(n)

Again we have a recursive formula that requires us to iterate through the cases. Here’s some code:

```def q(n):
if n == 0:
return 4
else:
return (4 * (q(n-1) - (2*catalan(n-1))))

def bigE(n):
if n == 0:
return 0.50
else:
em1 = bigE(n-1)
return em1 + ((1 - em1)*((2*catalan(n)) / q(n)))
```

Python experts will immediately realize there are several efficiency and memory problems lurking here;Â  on my machine trying bigE(500) takes about 5 seconds, and for large enough value bigE will fail with an exception because the recursion gets too deep. We can fix those problems but for now this code suffices to just get a feel for some results. I ran the calculations for some values and printed out the chance of NOT reaching equality (1-E(n)) by the nth generation:

```Chance equality NOT reached after    2 flips: 50.00%
Chance equality NOT reached after    4 flips: 37.50%
Chance equality NOT reached after    6 flips: 31.25%
Chance equality NOT reached after    8 flips: 27.34%
Chance equality NOT reached after   10 flips: 24.61%
Chance equality NOT reached after   20 flips: 17.62%
Chance equality NOT reached after   50 flips: 11.23%
Chance equality NOT reached after  100 flips:  7.96%
Chance equality NOT reached after  200 flips:  5.63%
Chance equality NOT reached after  400 flips:  3.99%
Chance equality NOT reached after  600 flips:  3.26%
Chance equality NOT reached after  800 flips:  2.82%
Chance equality NOT reached after 1000 flips:  2.52%
```

These values seem pretty counter-intuitive to me. A 2.52% chance of not reaching equality within 1000 flips is huge. It means that if you sit down periodically and try this yourself with a coin (assuming it is a fair coin), about 1 time out of 40 you will have to do 1000 flips and still would have not reached equality. I think most people when asked how frequently that would happen (especially people who misunderstand the “law of averages”) would guess it to be very rare, and certainly not as likely as a 1 in 40 shot.

I opened this posting with the question “what would be the average number of flips it takes to reach equality” but I’m still not able to answer that question. I can’t tell yet whether these formulas, with their tailing-off values as they get large, converge onto a bounded number or whether in fact the average number of flips to reach equality might be unbounded (i.e., infinite). I’m going to wrap up this post just to get it “out there” and do some more google research later. I do wonder if I may have reached the limit of my mathematical pay grade in this area.

I have written a python simulator to do some “flip until equality” experiments, but I’m not yet prepared to write it up as I want to ensure that its results are valid, and without better math support I’m not sure yet. I will say the preliminary results from the python simulator suggest that the answer to “what is the average” might be “infinity” (!!) though I am not yet certain of that. I have definitely had some astoundingly-long runs come out of the simulator so far though — the longest I’ve seen at the moment being over 380 billion flips before equality was reached (I *think* the python random number generator is suitable for this task, having a period of 2**19937, but that’s just one of many questions I have at the moment about my simulation code).

With the caveat that I’m not 100% certain my code’s results are correct, here are some preliminary results of interest from the simulation.

I’ve run 1,607,009 trials (sequences of coin flips, each starting with one flip and finishing when the number of heads and tails is equal).Â  Out of those trials, 804,083 reached equality in two flips. That’s 50.036% which tracks the computed/expected result correctly. Equality in four flips was reached in 200,809 trials, or 12.496% which also tracks the 12.5% prediction. The longest sequence seen was 380,172,077,974 flips — 380 billion flips!

The top ten longest flip sequences at the moment are:

```380,172,077,974
33,378,243,830
27,175,971,118
25,567,423,414
22,337,890,602
11,709,891,508
11,678,616,684
10,918,847,300
10,600,505,508
9,424,437,132
```

These results are fueling my suspicion that the overall average might be unbounded… i.e., that these huge sequence length numbers will occur frequently enough to drive the average towards infinity. But I’m not certain of that (can’t really be certain of it until we do more math).

Sequences of length 2 occurred 804,093 times, length 4 occurred 200,809 times, etc so if you imagined a very long list of numbers – 804,093 entries with “2”, 200,809 entries with “4”, etc, and added them all up and computed the average entry value you’d get 398,917 (obviously I didn’t do this manually, I did it by weighted-average of my recorded data points). This number might be meaningless though, if my speculation about the average being unbounded is correct (i.e., if the average will continue to increase as more trials are run). But at the moment if you ask me to answer “Dammit, Neil, what’s the average?” I guess I’d say 400K flips needed, on average, to reach equality. Again that’s a huge number and somewhat counter-intuitive; you’d think a fair coin would reach my heads/tails equality criteria in “some reasonable” amount of time. Most human intuitions about probability are wrong, and assuming my simulation is returning valid results, this appears to be yet another case of that.

## Building an HTTP/POST request/response protocol

In the previous post:Â Python Simple Client/Server Socket Communication ModuleÂ I began exploring using python’s http.server module to build a simple HTTP server as framework for a request/response protocol that could invoke a remote function and return some results to the client.

My goal is to be able to write code like this:

```# ON A SERVER MACHINE
def foo(input_dictionary):
results = ... do something with input_dictionary
return results

server = MakeServer(port=12345, func=foo)

# ON A CLIENT MACHINE
client = MakeClient(port=12345)

input_dictionary = { something ... }
results = client.command(input_dictionary)```

and essentially have a trivial remote procedure call mechanism allowing the client code to invoke the foo() function on the given input.

We’re going to build this using python’s http.server module. We are going to POST to “/” (indeed, we are going to ignore the path parameter), and the data of the post itself will be a JSON encoded python object. Rather than having to parse a “Data-length” line or anything like that ourselves, the HTTP protocol will handle that part for us; all we have to do is pluck the data length value out of the HTTP header and then read and decode the posted body ourselves.

Expanding on the do_GET example from my previous post, here’s a do_POST server function:

```from http.server import HTTPServer
from http.server import BaseHTTPRequestHandler
from http.server import HTTPStatus
import json

class MyRequestHandler(BaseHTTPRequestHandler):
def do_POST(self):
print("Got object: {}".format(obj))
self.send_response(HTTPStatus.OK)

server = HTTPServer(('', 12345), MyRequestHandler)
server.serve_forever()
```

This simple code needs more error checking. There is also an implicit conversion between 8-bit over-the-wire bytes and python internal string representation (full unicode) hiding inside the `json.loads` function. In fact, if you are running a version of python older than 3.6 you might be getting an error message related to this and the distinction between a python `bytes` object (returned by `read`) and a python string (expected by `json.loads`). So let’s fix that before going further.

In the good old days, all computing was done in English and the 8-bit ASCII character set was good enough for everyone. All characters fit into one byte, all bytes and strings could be considered more or less as interchangeable things. Obviously, those days are long gone. Even if you say you don’t care about japanese/chinese/etc speakers and their ability to send their characters (which would be a mistake of course), even English users demand full unicode support if for no other reason than to be able to put smiley faces and other emojis into their data. UnicodeÂ 😀❤️🐳💡🎉Â Happens!

Starting in python 3.x python uses Unicode as the native format for strings. The good part about this is all your code automatically will work with all of those character types. The bad part about this is you have to be cognizant of how Unicode (more than 8 bits per character) interfaces with parts of the world that operate on 8-bit bytes – like TCP streams for example. Arguably this “bad” thing is actually a “good” thing as it forces you to make your code work for everyone, even people who don’t use only ASCII.

So to see Unicode in action in python, try this:

```s = '\U0001F600'
#     ^ **NOTE** that's an uppercase U
print("s = /{}/ and len(s) = {}".format(s, len(s)))```

This will show you:

It’s beyond the scope of this post to explain why we usually consider “native Unicode” to be an internal representation of characters and use a different external representation when sending Unicode strings “over the wire” in a protocol. I am just pointing out that this is something we have to do – pick an encoding method and use it properly on both ends.

The most common, standard, encoding used for applications like this is called utf-8, which has the advantage that the first 128 ASCII characters (all the “good old days” characters) are still encoded the same way, as a single byte, which tends to increase interoperability with naive/old programs that are not Unicode enabled (in fact this is one of the “beyond the scope of this post” reasons to encode characters rather than send everything in its 32-bit raw Unicode glory).

So we are going to convert our internal strings into a python `bytes` object on one side, and back on the other. A `bytes` object is an iterable sequence of 8-bit integers (0 .. 255), and has a `decode` method for converting that sequence of bytes into a Unicode string. For example:

```>>> letterA = bytes([65]).decode('utf-8')
>>> print(letterA)
A
```

What this is showing you is that a single byte, with value 65, when decoded using the ‘utf-8’ encoding, becomes a string of one character, an uppercase A. This is demonstrating a property of utf-8, namely that the original (“good old days”) 8-bit ASCII character values are encoded as themselves. Other characters are encoded using multiple bytes, so for example:

```>>> jpnhouse = bytes([229, 174, 182]).decode('utf-8')
>>> print(jpnhouse)
å®¶
```

this is demonstrating that the three-byte sequence [229, 174, 182], when decoded using ‘utf-8’, will become a single character that is (I think) the word “house” in Japanese.

We don’t really need to understand encodings other than to know the encode/decode steps are there, have to be performed, and have to use the same encoding on both sides of the wire. Starting in python version 3.6, the json.loads function will accept a bytes object and do an implicit utf-8 decoding for you. This is why the first example code given up above “works” if you are running python 3.6 or later, but will fail with a complaint about strings versus bytes on earlier versions.Â  I think it is better practice for us to make that step explicit, which will also have the benefit of making that example code work on older versions of python (python 3.x).

With the explicit decode step the code becomes:

```class MyRequestHandler(BaseHTTPRequestHandler):
def do_POST(self):
data_str = data_bytes.decode('utf-8')
print("Got object: {}".format(obj))
self.send_response(HTTPStatus.OK)
```

This still isn’t sending any response (other than the “OK” HTTP code) so let’s add that. This revised handler takes out the print on the server side and wraps the received object inside another dictionary `{'echo': obj}` and then sends that back to the client as the response data:

```class MyRequestHandler(BaseHTTPRequestHandler):
def do_POST(self):
rslt_str = json.dumps({'echo': obj})
rslt_bytes = rslt_str.encode('utf-8')
self.send_response(HTTPStatus.OK)
self.wfile.write(rslt_bytes)
```

This works – but before putting it into production it should probably be enhanced in several ways. It leaves out several HTTP headers in the response; we should probably fill in Content-Type (‘application/json’ would be appropriate) and Content-Length. It turns out Content-Length isn’t “needed” because the default response format is HTTP/1.0 which defines the length by closing the stream at the end. But we might want to use HTTP/1.1 in the response format and include a Content-Length (and thus also allow for persistent connections which were not supported under the HTTP/1.0 format). All of these elaborations are left as an exercise for the reader at this point, in consultation with the http.server module documentation. I made many of these improvements in the code that I will also be posting on github (TBD at the time of writing this posting).

With this primitive server we have enough framework to use curl to fire commands at a server and have it invoke some function (in this case hardwired into “encapsulate the object and return it”) and return results to the client. This works really well with not very many lines of code!

Let’s build an explicit client instead of using curl (although being able to use curl does demonstrate one of the advantages of picking a standard transport protocol such as HTTP/POST). Here is a bare bones test request function:

```from http.client import HTTPConnection
import json

def testrq(obj):
c = HTTPConnection('localhost', 12345)
c.connect()
encoded = json.dumps(obj).encode('utf-8')
response_string = response_bytes.decode('utf-8')
```

This connects (to hardwired localhost:12345) and sends whatever object (obj) you provide, gets the response (see the http.client / HTTPConnection documentation) and decodes it as a JSON object. Obviously all semblance of generalization and error checking has been omitted here. But this works.

Now that we know how to send arbitrary python objects back and forth from client and server we can work on building a real, but still simple, generalized framework for all this. That will be the next post.

## Python Simple Client/Server Socket Communication Module

I wanted a python module with a simple client/server request/response protocol … something that would let me invoke a function remotely, with code looking something like:

```def server_function(input_dictionary):
... do something with the input_dictionary ...
return {'result': 'blah blah blah'}

def client():
server = ... connect to the server ...
request = { some dictionary of stuff here }
result = server.command(request)
... result is a dictionary as returned by the server```

The client would formulate a request as a dictionary, send it to the server, the server processes it, formulates a dictionary as a reply, and returns that to the client. In effect this is a simplified form of a general remote procedure call system.

Google to the rescue? Yes, there are any number of github repositories and official modules out there that sort of do this. However, I had three problems with all the ones I found:

• They had bugs in them regarding TCP stream semantics, or, perhaps, if not outright bugs they were written with limitations that wouldn’t generalize to transferring large amounts of data in a single request/response pair.
• They worked at a byte stream level of abstraction, and I wanted a higher-level “message” or “packet” abstraction (more like the above “send a dictionary, get a dictionary” model).
• OR … they were huge frameworks, that were very powerful but felt like overkill for my application. I didn’t needÂ to set up an entire REST API, I didn’t need the features of a “real” application server, etc. Of course this is dangerous thinking, in that anything that starts out as a trivial 100-line hack might someday grow into something real. Nevertheless, I decided to proceed with a small implementation of something for my specific purpose. Though, as we’ll see, I did conclude that HTTP/POST was a suitable transport mechanism and ended up implemented what can only be thought of as a completely degenerate (one URL) imitation of a REST API. It’s up to you whether to think of what I’ve done here as something useful, perhaps for limited applications or at least as a learning environment, or a complete waste of time.

On the bugs front, let’s take a look at this code from 21.21.4.1 in the python version 3.6 library documentation for the socketserver module. Here’s the relevant part of their server example:

```# self.request is the TCP socket connected to the client
data = self.request.recv(1024).strip()
# just send back the same data, but upper-cased
self.request.sendall(data.upper())
```

Leveraging the surrounding socketserver framework, this code receives a string over a socket, converts it to upper case, and sends that back as the response. It’s not hard to imagine generalizing this to where perhaps it is passing JSON-serialized python dictionaries (or other arbitrary serializable objects) – voila! Just what I was looking for.

But there’s a problem: TCP is a stream oriented protocol and not a “message oriented” protocol. There are potential bugs lurking in some subtle assumptions in the above code.

What happens if you want to send a 4000 byte string? Let’s try it. I modified the above code to say “4096” in the recv call and similarly modified the client. In fact, the python socket module documentation recommends 4096 as a reasonable value to specify in calls to recv():

```Note For best match with hardware and network
realities, the value ofÂ bufsizeÂ should be a relatively
small power of 2, for example, 4096.
```

So with that in mind here are the relevant excerpts of the two sides modified to send 4000 bytes (and using a 4096 byte recv() buffer):

```# Server, modified to read up to 4096 bytes
def handle(self):
data = self.request.recv(4096).strip()
self.request.sendall(data.upper())
```
```# client, modified to send/recv up to 4096 bytes
# and report the amount of data sent/received
with socket.socket(socket.AF_INET,
socket.SOCK_STREAM) as sock:
data = "0123456789" * 400        # 4000 bytes total
sock.connect((HOST, PORT))
sock.sendall(bytes(data, "utf-8"))

print(
"Bytes sent {}, bytes received {}".format(

```

I ran the client in a loop and got output like this:

```Bytes sent 4000, bytes received 4000
Bytes sent 4000, bytes received 2896
Bytes sent 4000, bytes received 2896
Bytes sent 4000, bytes received 2896
Bytes sent 4000, bytes received 2896
Bytes sent 4000, bytes received 4000
Bytes sent 4000, bytes received 1448
Bytes sent 4000, bytes received 2896
Bytes sent 4000, bytes received 2896
Bytes sent 4000, bytes received 2896
Bytes sent 4000, bytes received 2896
Bytes sent 4000, bytes received 2896
Bytes sent 4000, bytes received 2896
```

In fact it’s worse than this. If you put a print(len(data)) statement in the server to see what it thinks it is getting, you will find that sometimes the mismatch is on the server side (i.e., the server thinks the client sent less than 4000 bytes), and sometimes it is on the client side (i.e., the server got all the bytes, but the client thinks it got fewer bytes in the reply), and sometimes it is on both.

What’s going on???

TCP is a byte-stream protocol. It reliably delivers the bytes you give to it, and it delivers them in order. What it does NOT do is preserve any notion of message boundaries within that byte stream. We might make one call to sendall() with 4000 bytes, but that does not guarantee that a single recv() on the other side will get all 4000 bytes at once.

When TCP packets are sent “over the wire” (between two distinct machines, via some form of underlying network technology) they will be broken up into segments that have a maximum size, called, unsurprisingly, the maximum segment size, and abbreviated as MSS. The specific MSS value varies network by network, sometimes because of network technology differences, sometimes because of other factors.

If you run the above client and server on a single machine then there is no actual network between the two endpoints. The communication happens over TCP, but entirely within the confines of one machine. In that case it is likely that the code will work perfectly; 4000 bytes will be sent/received consistently on both sides (however this depends on operating system implementation details).

But if you split the client and server across two distinct machines, bringing real network technology (and limitations) into play, you will see results like those I got above. You might even see different results depending on whether your client and server are separated by a wired network, a WiFi network, a cellular network, and so forth, and whether it is a local network (from machine A to machine B in your own house, for example) or a true WAN (from machine A in your house to perhaps machine B running as an AWS cloud instance).

The fact that this simple-minded code “works” when you try it just on your own machine and doesn’t show any bugs until deployed on a larger network is a common source of misunderstanding – leading to hard-to-find bugs.

If the network can only send 1448 bytes in one “segment”, then sending 4000 bytes becomes (at least) a total of three TCP segments over the wire. They can be sent very quickly, so in many cases all three will arrive so fast that the other side will see all the data at once and will receive the full 4000 bytes. But this won’t always happen; sometimes the first segment will arrive and perhaps the second segment is delayed just long enough for the other side to not see it before the recv() call thinks it is finished. And that’s how we end up seeing only 1448 bytes transmitted sometimes, because the recv() call completes its work before the second chunk of data arrives at the destination machine.

Aside: sendall is a python-level convenience function and not an operating system (socket API) primitive. We don’t know whether sendall makes a single call to the underlying socket API to send data or breaks your data down into chunks and loops over multiple OS calls. However, it doesn’t matter. This segmentation of data sent over TCP will happen no matter what any time the MSS is smaller than the total amount of data you are trying to send.Â Â

Emphasizing again: TCP is a byte-stream protocol, not a message protocol. If we want to send and receive structured messages, we have to layer our own application-layer protocol on top of the byte stream provided by TCP. So instead of just sending a JSON-encoded dictionary from client to server, which could easily be 4000 bytes (or more!) and trigger the problems we’re seeing above, we have to put some “decorations” on our data so that each side can parse it and structure it accordingly.

It’s common (and has been for decades) to define protocols to do this using a line-oriented format of headers and data. So, for example, instead of just sending JSON data we can send something that looks like this:

```Data-length: 37
["this is a JSON list of one string"]```

In this case we have a simple header “Data-length: 37” followed by (if wordpress hasn’t mangled it too badly) exactly 37 characters which are a JSON presentation of a list containing one string.

Now instead of just calling recv() and trying to read the entire message, we would read data line-by-line, parse the fields as they come in, and we can loop over multiple recv() calls if necessary because the “Data-length” header tells us how many bytes to expect after that.

If you are paying attention, you’ll realize there are still problems lurking here. Even with this “send a line telling us how long the data will be” protocol, it is still conceptually possible that even that line, no matter how short it is, might be broken up into multiple TCP segments. In other words, just because this is short:

`Data-length: 37`

doesn’t mean we will always get it in one read() operation when we are using a TCP stream. Fortunately python (as do many other languages) provides io libraries that essentially read an input stream character by character. That low level code has routines such as “readline” that go character-by-character (waiting for additional input from the TCP stream as necessary) building up a string into a “line” until a newline character is reached. This means we can write our code in terms of “read one line” and not worry about TCP segmentation; the newline character in effect becomes an in-band message boundary character that we can parse out and reconstruct the line abstraction from the raw TCP byte stream.

So we could go off and (carefully) write a bunch of code to implement this sort of protocol on top of TCP, and in fact I’ve done that as an exercise, but pretty soon it becomes apparent that we are in essence re-inventing something that already exists… HTTP requests, and especially HTTP POST requests of (perhaps) JSON encoded application data.

At which point “just go install one of those big REST frameworks” is a plausible answer to my question. Nevertheless, I decided to try to implement a smaller example of HTTP/POST transport for this if for no other reason than as an interesting learning exercise.

In fact python has some modules that make implementing a special purpose HTTP server pretty simple. The http.server module includes an HTTPServer class that allows you to pretty quickly set up a trivial server. Here is a server that just prints out what path name it received and returns an HTTP OK response (with no other data) the client:

```from http.server import HTTPServer
from http.server import BaseHTTPRequestHandler
from http.server import HTTPStatus

class MyRequestHandler(BaseHTTPRequestHandler):
def do_GET(self):
print("GET request on {}".format(self.path))
self.send_response(HTTPStatus.OK)

server = HTTPServer(('', 12345), MyRequestHandler)
server.serve_forever()
```

Running this establishes an HTTP server on port 12345 and you can play with it by sending it GET requests using something like curl:

```In window A:

% python3 theabovecode.py

In window B:

% curl localhost:12345/testpath```

and you’ll see output indicating that the server received a request for path “/testpath”

In my next post I’ll write up how to use this framework to process an HTTP POST request of JSON-encoded data to/from a trivial server.

## Random Python Performance Posting

I’ve been writing some very elementary probability simulations in python and wanted to simulate millions of coin flips. I started out with this somewhat-obvious code:

```    ht = [0, 0]
for i in range(n):
ht[random.choice((0, 1))] += 1
```

which counts heads/tails in ht[0] and ht[1] and performs n flips.

Can this code be sped up? Inserting time.time() calls before/after the “for” loop I found that this code can perform 10 million flips in 11.8 seconds. So it’s about 1.18 microseconds per flip.

[ aside: all measurements were made on my MacBook Air. I tried to avoid doing anything else while running the timing tests, and in all cases took the fastest result I could get. All results for a given method were similar and were run multiple times at multiple different points in time, so I believe there were no confounding factors with background activities. Your mileage may vary. ]

Borrowing a trick from the old geezer days, I thought “surely if we unroll the loop this will go faster”.Â  So I changed it to “for i in range(n//100)” and then put 100 of the “ht[random.choice(…” statements in the body of the loop.

Result? 1.16 microseconds per flip. Ok, that’s faster, but by less than 1%. Kudos to python for having efficient looping, plus, of course, the random() call dominates the cost anyway.

Ok maybe random.choice is slower than explicitly asking for a random number:

```    ht = [0, 0]
for i in range(n):
ht[random.randrange(2)] += 1
```

Nope, that came in significantly slower! 1.33 microseconds per flip.

Maybe I could speed it up by eliminating array access and just adding up +1 and -1 values in a loop and reconstructing the heads vs. tails numbers after:

```    delta = 0
for i in range(n):
delta += random.choice((-1, 1))
```

That ran as fast as 1.14 microseconds per flip, but still just an insignificant improvement over the the original and arguably more “obvious” code.

Of course, by this point readers are screaming at me “duh, all the time is in the random function and there’s nothing you can do about it” and they are right. I should have tested that first. For example, this code:

```    ht = [0, 0]
for i in range(n):
ht[1] += 1
```

runs at 0.11 microseconds per “flip” (albeit there is no flip, it just always adds to ht[1] in this example). Or if we go all the way to the delta method, it will be as little as 0.06 microseconds per “flip” for this:

```    delta = 0
for i in range(n):
delta += 1
```

obviously there is no “flip” here but it’s fast.

Of course, we can use a single call to random to generate more than one random bit at a time. In C we might generate a 32-bit random number and then peel off the bits one by one. I’m guessing that doing this in python won’t be as efficient, but there’s still a way to easily amortize one call to the random method over multiple flips. For example we can enumerate all sixteen possibilities for four flips. In binary, they would be: 0000, 0001, 0010, 0011, … 1100 1101 1110 1111. We can note that the first (“0000”) is 4 heads (arbitrarily calling the 0 heads, it doesn’t matter which) and 0 tails. The next one (“0001”) is 3 heads and 1 tails, and so on. The complete enumeration of the outcomes looks like this:

```    fourflips = ((4, 0),    # 0000
(3, 1),    # 0001
(3, 1),    # 0010
(2, 2),    # 0011
(3, 1),    # 0100
(2, 2),    # 0101
(2, 2),    # 0110
(1, 3),    # 0111
(3, 1),    # 1000
(2, 2),    # 1001
(2, 2),    # 1010
(1, 3),    # 1011
(2, 2),    # 1100
(1, 3),    # 1101
(1, 3),    # 1110
(0, 4))    # 1111
```

and now our loop could look like this:

```    ht = [0, 0]
for i in range(n//4):
flips = random.choice(fourflips)
ht[0] += flips[0]
ht[1] += flips[1]
```

Running this code we get 0.330 microseconds per flip. About three and a half times as fast as the original code – which makes sense, because we are calling the random function one fourth as many times and that call dominates the timing.

Of course, in doing this optimization I’ve lost the data on individual flips and now I only know the relative number of heads and tails for the four flips that are grouped together. In other words, I can no longer distinguish between a group of four flips that were “HTHT” vs “HHTT” or many other combinations that will all appear as just (2, 2) when returned by random.choice. In my application this is unimportant. If it were important, I could of course have stored the individual flips in the fourflips variable, constructing it this way:

```    fourflips = (
(0, 0, 0, 0),
(0, 0, 0, 1),
(0, 0, 1, 0),
... etc
```

and processing it accordingly. I stuck with the aggregated counts because that’s all I cared about in my simulations.

We can take this one level further and combine the running delta method (+1 for a heads, -1 for a tails, and reconstruct the final number of heads and tails from the total number of flips and the delta), plus we can compute the delta ahead of time for each element of the fourflips list (e.g., (0, 4) is a delta of -4, (2, 2) is a delta of 0, etc). Combining those concepts we get:

```    ht = [0, 0]
fourdeltas = (4, 2, 2, 0, 2, 0, 0, -2,
2, 0, 0, -2, 0, -2, -2, -4)
delta = 0
for i in range(n//4):
delta += random.choice(fourdeltas)
```

which gets us all the way down to 0.285 microseconds per flip.

How far can this optimization go? As far as you have patience for and as large as you want the list of deltas to be. I tested the same idea at five flips precomputed as deltas (i.e., five bits at a time) and got down to 0.232 microseconds per flip. There’s a diminishing return for each additional bit added – at five deltas we’re calling the random function 20% as often as the original code, but at four deltas we were already down to 25% so the improvement isn’t as dramatic. Also the deltas list doubles in size each time though of course that can be precomputed by a program so it’s not clear that there’s a pragmatic limit. I don’t know if at some point the length of the deltas list makes random.choice execute slower; I don’t think it would but it depends on implementation details.

For my purposes I was happy at five bits per call to random.choice and 0.232 microseconds per flip, which is more or less five times faster than the original.