msgpack for topojson

Mike Bostock has published topojson, a clever variant of GeoJSON that more efficiently encodes toplogies. For a map of US counties, for instance, the GeoJSON approach is a bunch of LineStrings or Polygons that define each state boundary. There’s a lot of redundant line segments; TopoJSON eliminates those by encoding the line segments separately and then referencing them to define each state boundary. Also TopoJSON uses a relative integer representation instead of absolute floating point geocoordinates, greatly reducing size.

But is it as small as possible? There’s a lot of pairs of small integers like [-1,0],[1,-20],[-14,10],[-27,-9]; roughly 8 bytes for what could fit in 2. I took a quick crack at using msgpack to encode the JSON for us-counties more efficiently, this is what I learned:

 Source 668312
 Gzip 175895
 Msgpack 344208
 Msgpack gzip 166761

Msgpack is about half the size of JSON, but after gzip that difference goes away. That’s what I found last time I looked at msgpack, too.

Some custom encoding could make the TopoJSON format smaller still, but I’m not convinced that after gzip it will make much difference. Probably not worth the complexity. The one thing I’d consider is recoding the list of pairs of integers  just as a list of integers, so instead of [[-1,0],[1,-20],[-14,10],[-27,-9]] you have [-1,0,1,-20,-14,10,-27,-9]. That removes 2 bytes per pair, although after gzip that may be negligible. (update: Mike tried it. 552K (vs. 656K) for uncompressed 164K (vs. 172K) for compressed)


#!/usr/bin/env python

import msgpack, zlib, sys, json

text = file(sys.argv[1]).read()
data = json.loads(text)
msgpackText = msgpack.packb(data)

def log(s, d):
    print "%20s %s" % (s, d)

log ("Source", len(text))
log ("Gzip", len(zlib.compress(text)))
log ("Msgpack", len(msgpackText))
log ("Msgpack gzip", len(zlib.compress(msgpackText)))