Checking out msgpack

msgpack sounds like it could be useful: a binary format alternative to JSON that promises to be “like JSON but it’s faster and smaller”. Is it really? Not significantly; on a sample of 175 JSON files I had lying around the msgpack files were on average 75% the size of the JSON files. Better, but not significantly so. And if you’re a bit careful with JSON formatting and float precision, it’s not clear msgpack has anything to offer.

The big winner (ukiah.json, msgpack is 12% the original) is a GeoJSON file with a few thousand lat/lon coordinates in it, all pretty-printed with spaces. Some files actually get bigger with msgpack; a quick look suggests those are files with short floats in them like “32.17”, and it may be the Python encoder is giving those a full 8 bytes for an IEEE double.

For a file I actually care about, my WindHistory stations-md.json database, the 1 meg JSON files goes to 500k. That’s a big win for me in a deployed webapp, I think msgpack may do a particularly good job with the arrays of small numbers that file contains, e.g. [0,5,6,4,4,15,18,14,9,5,7,7,6]. I’d considered doing a custom format for that, dropping in msgpack would sure be easier.

Update: I did a quick extra comparison, passing the data through zlib.compress() after encoding, before comparing lengths. Compressed msgpack is worse than compressed json! For my sample, the msgpack/gzip data was on average 106% the size of the json/gzip data. For the WindHistory database the msgpack was still smaller, but only a little; 90%. gzip must be good at removing the text redundancy in json files full of numbers. OTOH I don’t understand why msgpack would ever be worse.

I took a quick look at the format spec: I think all msgpack is doing is storing numbers in binary instead of ASCII. That’s an improvement, particularly for codec speed, but it doesn’t seem to make all that big a difference in size. Here’s the msgpack author’s own opinion on comparison to other formats. The most useful link is the StackExchange answer.

What I’m really hoping for is an encoding that divines the structure of the stored data and builds compressed representations based on an ad hoc schema. It’s not crazy, folks were doing this with XML in Java webapps 10 years ago. But it’s not easy either. First step; a lookup dictionary for key names in object maps.

import json, sys, msgpack, os

pcts = []
for fn in sys.argv[1:]:
    try:
        jsonData = file(fn).read()
        data = json.loads(jsonData)
    except: continue

    msgData = msgpack.packb(data)
    pct = 100.0*len(msgData)/len(jsonData)
    print "%d%% %s json %d msgpack %d" % (pct, os.path.basename(fn), len(jsonData), len(msgData))
    pcts.append(pct)

print "%d%% average size" % (sum(pcts) / len(pcts))
12% ukiah.json json 2025795 msgpack 251349
24% faithful.json json 1103 msgpack 275
37% multiple-child-classes.json json 1645 msgpack 609
40% pets.json json 319 msgpack 128
41% sample_flatpages.json json 762 msgpack 318
42% m2m_through.json json 682 msgpack 292
42% mypeople.json json 142 msgpack 61
42% nk-inheritance.json json 344 msgpack 146
42% non_natural_1.json json 481 msgpack 206
43% big-fixture.json json 1749 msgpack 756
43% fixture6.json json 777 msgpack 341
44% forward_ref_lookup.json json 645 msgpack 286
44% initial_data.json json 5758 msgpack 2558
44% sequence.json json 231 msgpack 102
44% thingy.json json 147 msgpack 66
45% fixture8.json json 651 msgpack 295
46% initial_data.json json 6220 msgpack 2882
46% multidb.default.json json 546 msgpack 253
46% multidb.other.json json 543 msgpack 251
46% testdata.json json 1849 msgpack 860
47% fixture1.json json 744 msgpack 357
47% raw_query_books.json json 2602 msgpack 1240
48% absolute.json json 158 msgpack 76
49% db_fixture_3.nosuchdb.json json 193 msgpack 95
50% authtestdata.json json 1785 msgpack 895
50% empty.json json 2 msgpack 1
50% fixture2.json json 408 msgpack 206
50% multidb-common.json json 196 msgpack 99
50% stations-md.json json 10835 msgpack 5523
50% testdata.json json 1783 msgpack 894
50% testdata.json json 1783 msgpack 894
51% comment_tests.json json 968 msgpack 502
53% db_fixture_1.default.json json 208 msgpack 112
53% fixture1.json json 434 msgpack 231
53% fixture2.json json 437 msgpack 234
53% initial_data.json json 211 msgpack 112
54% flare.json json 11413 msgpack 6220
54% zeppelin.json json 318590 msgpack 174435
55% feeddata.json json 725 msgpack 403
55% initial_data.json json 226 msgpack 126
56% direct_message-destroy.json json 397 msgpack 225
56% direct_messages-new.json json 397 msgpack 225
58% skyrim.json json 22577 msgpack 13124
59% us-borders.json json 1421 msgpack 843
61% show-89512102.json json 656 msgpack 405
61% update.json json 656 msgpack 405
61% user_timeline-kesuke.json json 658 msgpack 406
62% manifest.json json 307 msgpack 193
64% CellLocation-geo.json json 936985 msgpack 608302
64% WifiLocation-geo.json json 5026275 msgpack 3249817
64% sample.json json 34 msgpack 22
66% model-inheritance.json json 161 msgpack 107
66% test-region.json json 104 msgpack 69
67% package.json json 732 msgpack 494
68% miserables.json json 11535 msgpack 7877
69% streets.json json 2479390 msgpack 1717654
69% streets.json json 2479390 msgpack 1717654
69% streets.json json 2479390 msgpack 1717654
69% streets.json json 2479390 msgpack 1717654
69% streets.json json 2479390 msgpack 1717654
69% streets.json json 2479390 msgpack 1717654
69% test-country.json json 23138 msgpack 16046
71% mkassawara-aprs.json json 5644 msgpack 4036
72% cells.json json 2186447 msgpack 1583116
72% steps.json json 98475 msgpack 71556
73% mkassawara-first-solo-xc.json json 2368 msgpack 1730
73% walks.json json 269914 msgpack 199435
74% marimekko.json json 1139 msgpack 847
75% ohio.json json 244951 msgpack 184823
75% stations.json json 2185 msgpack 1648
75% test-location.json json 31890 msgpack 23957
76% average size
76% bullets.json json 543 msgpack 417
76% status-destroy.json json 473 msgpack 364
76% us-state-centroids.json json 7849 msgpack 6009
79% track.json json 2570 msgpack 2048
79% us-counties.json json 901283 msgpack 718329
80% direct_messages.json json 240 msgpack 192
80% public_timeline.json json 9488 msgpack 7670
81% us-states.json json 88430 msgpack 71888
81% user_timeline.json json 356 msgpack 291
82% initial_data.json json 4563 msgpack 3748
82% turkers.json json 1866 msgpack 1533
83% friends_timeline-kesuke.json json 9045 msgpack 7585
84% KMER-df30.json json 321 msgpack 272
85% friends.json json 17796 msgpack 15233
85% world-countries.json json 252515 msgpack 216432
86% KLVK-df30.json json 322 msgpack 278
86% KRHV-df30.json json 319 msgpack 276
86% followers.json json 29176 msgpack 25124
86% trip.json json 25665 msgpack 22184
87% flare-imports.json json 34321 msgpack 30201
87% friendship-create.json json 468 msgpack 408
87% friendship-destroy.json json 468 msgpack 408
87% replies.json json 3298 msgpack 2895
87% show-dewitt.json json 722 msgpack 630
88% KAPC-df30.json json 322 msgpack 286
88% featured.json json 10056 msgpack 8854
89% KCCR-df30.json json 324 msgpack 289
89% KMRY-df30.json json 327 msgpack 293
89% KPAO-df30.json json 317 msgpack 283
89% KSTS-df30.json json 324 msgpack 291
89% KWVI-df30.json json 330 msgpack 296
90% KOAK-df30.json json 344 msgpack 313
91% KMAE-df30.json json 326 msgpack 299
91% KNUQ-df30.json json 317 msgpack 290
91% KVCB-df30.json json 328 msgpack 301
92% KHWD-df30.json json 327 msgpack 303
92% KMHR-df30.json json 332 msgpack 306
92% KSFO-df30.json json 348 msgpack 321
92% KSJC-df30.json json 334 msgpack 308
92% KSMF-df30.json json 339 msgpack 312
92% KSNS-df30.json json 326 msgpack 300
92% KSQL-df30.json json 317 msgpack 293
93% KMCE-df30.json json 328 msgpack 307
93% KMHR.json json 7656 msgpack 7182
93% KMOD-df30.json json 327 msgpack 307
93% KPAO.json json 7372 msgpack 6891
93% KSAC-df30.json json 333 msgpack 313
93% KSAC.json json 7485 msgpack 7035
93% KSCK-df30.json json 336 msgpack 314
93% KSUU-df30.json json 341 msgpack 320
94% KCCR.json json 7465 msgpack 7034
94% KMOD.json json 7388 msgpack 6953
94% KRHV.json json 7106 msgpack 6714
94% KSFO.json json 7710 msgpack 7269
94% KSQL.json json 7615 msgpack 7199
94% KSUU.json json 7555 msgpack 7160
94% district.json json 53585 msgpack 50817
94% district.json json 53585 msgpack 50817
94% district.json json 53585 msgpack 50817
95% KAPC.json json 7541 msgpack 7220
95% KDVO.json json 7826 msgpack 7467
95% KHWD.json json 7646 msgpack 7329
95% KLVK.json json 7442 msgpack 7100
95% KMAE.json json 7582 msgpack 7231
95% KMCE.json json 7278 msgpack 6959
95% KMER.json json 7258 msgpack 6909
95% KOAK.json json 7662 msgpack 7332
95% KSCK.json json 7591 msgpack 7263
95% KSTS.json json 7548 msgpack 7210
95% KWVI.json json 7711 msgpack 7375
96% KMRY.json json 7760 msgpack 7454
96% KNUQ.json json 7701 msgpack 7433
96% KSMF.json json 7777 msgpack 7512
96% KSNS.json json 7700 msgpack 7426
96% KVCB.json json 7726 msgpack 7417
97% KSJC.json json 7758 msgpack 7567
99% test-usstates.json json 152305 msgpack 152245
99% test-worldborders.json json 911213 msgpack 911121
107% flowers.json json 14123 msgpack 15152
110% unemployment.json json 1383 msgpack 1532
120% world.json json 142051 msgpack 170748
121% unemployment.json json 39718 msgpack 48273
76% average size