msgpack sounds like it could be useful: a binary format alternative to JSON that promises to be “like JSON but it’s faster and smaller”. Is it really? Not significantly; on a sample of 175 JSON files I had lying around the msgpack files were on average 75% the size of the JSON files. Better, but not significantly so. And if you’re a bit careful with JSON formatting and float precision, it’s not clear msgpack has anything to offer.
The big winner (ukiah.json, msgpack is 12% the original) is a GeoJSON file with a few thousand lat/lon coordinates in it, all pretty-printed with spaces. Some files actually get bigger with msgpack; a quick look suggests those are files with short floats in them like “32.17”, and it may be the Python encoder is giving those a full 8 bytes for an IEEE double.
For a file I actually care about, my WindHistory stations-md.json database, the 1 meg JSON files goes to 500k. That’s a big win for me in a deployed webapp, I think msgpack may do a particularly good job with the arrays of small numbers that file contains, e.g. [0,5,6,4,4,15,18,14,9,5,7,7,6]
. I’d considered doing a custom format for that, dropping in msgpack would sure be easier.
Update: I did a quick extra comparison, passing the data through zlib.compress() after encoding, before comparing lengths. Compressed msgpack is worse than compressed json! For my sample, the msgpack/gzip data was on average 106% the size of the json/gzip data. For the WindHistory database the msgpack was still smaller, but only a little; 90%. gzip must be good at removing the text redundancy in json files full of numbers. OTOH I don’t understand why msgpack would ever be worse.
I took a quick look at the format spec: I think all msgpack is doing is storing numbers in binary instead of ASCII. That’s an improvement, particularly for codec speed, but it doesn’t seem to make all that big a difference in size. Here’s the msgpack author’s own opinion on comparison to other formats. The most useful link is the StackExchange answer.
What I’m really hoping for is an encoding that divines the structure of the stored data and builds compressed representations based on an ad hoc schema. It’s not crazy, folks were doing this with XML in Java webapps 10 years ago. But it’s not easy either. First step; a lookup dictionary for key names in object maps.
import json, sys, msgpack, os pcts = [] for fn in sys.argv[1:]: try: jsonData = file(fn).read() data = json.loads(jsonData) except: continue msgData = msgpack.packb(data) pct = 100.0*len(msgData)/len(jsonData) print "%d%% %s json %d msgpack %d" % (pct, os.path.basename(fn), len(jsonData), len(msgData)) pcts.append(pct) print "%d%% average size" % (sum(pcts) / len(pcts))
12% ukiah.json json 2025795 msgpack 251349 24% faithful.json json 1103 msgpack 275 37% multiple-child-classes.json json 1645 msgpack 609 40% pets.json json 319 msgpack 128 41% sample_flatpages.json json 762 msgpack 318 42% m2m_through.json json 682 msgpack 292 42% mypeople.json json 142 msgpack 61 42% nk-inheritance.json json 344 msgpack 146 42% non_natural_1.json json 481 msgpack 206 43% big-fixture.json json 1749 msgpack 756 43% fixture6.json json 777 msgpack 341 44% forward_ref_lookup.json json 645 msgpack 286 44% initial_data.json json 5758 msgpack 2558 44% sequence.json json 231 msgpack 102 44% thingy.json json 147 msgpack 66 45% fixture8.json json 651 msgpack 295 46% initial_data.json json 6220 msgpack 2882 46% multidb.default.json json 546 msgpack 253 46% multidb.other.json json 543 msgpack 251 46% testdata.json json 1849 msgpack 860 47% fixture1.json json 744 msgpack 357 47% raw_query_books.json json 2602 msgpack 1240 48% absolute.json json 158 msgpack 76 49% db_fixture_3.nosuchdb.json json 193 msgpack 95 50% authtestdata.json json 1785 msgpack 895 50% empty.json json 2 msgpack 1 50% fixture2.json json 408 msgpack 206 50% multidb-common.json json 196 msgpack 99 50% stations-md.json json 10835 msgpack 5523 50% testdata.json json 1783 msgpack 894 50% testdata.json json 1783 msgpack 894 51% comment_tests.json json 968 msgpack 502 53% db_fixture_1.default.json json 208 msgpack 112 53% fixture1.json json 434 msgpack 231 53% fixture2.json json 437 msgpack 234 53% initial_data.json json 211 msgpack 112 54% flare.json json 11413 msgpack 6220 54% zeppelin.json json 318590 msgpack 174435 55% feeddata.json json 725 msgpack 403 55% initial_data.json json 226 msgpack 126 56% direct_message-destroy.json json 397 msgpack 225 56% direct_messages-new.json json 397 msgpack 225 58% skyrim.json json 22577 msgpack 13124 59% us-borders.json json 1421 msgpack 843 61% show-89512102.json json 656 msgpack 405 61% update.json json 656 msgpack 405 61% user_timeline-kesuke.json json 658 msgpack 406 62% manifest.json json 307 msgpack 193 64% CellLocation-geo.json json 936985 msgpack 608302 64% WifiLocation-geo.json json 5026275 msgpack 3249817 64% sample.json json 34 msgpack 22 66% model-inheritance.json json 161 msgpack 107 66% test-region.json json 104 msgpack 69 67% package.json json 732 msgpack 494 68% miserables.json json 11535 msgpack 7877 69% streets.json json 2479390 msgpack 1717654 69% streets.json json 2479390 msgpack 1717654 69% streets.json json 2479390 msgpack 1717654 69% streets.json json 2479390 msgpack 1717654 69% streets.json json 2479390 msgpack 1717654 69% streets.json json 2479390 msgpack 1717654 69% test-country.json json 23138 msgpack 16046 71% mkassawara-aprs.json json 5644 msgpack 4036 72% cells.json json 2186447 msgpack 1583116 72% steps.json json 98475 msgpack 71556 73% mkassawara-first-solo-xc.json json 2368 msgpack 1730 73% walks.json json 269914 msgpack 199435 74% marimekko.json json 1139 msgpack 847 75% ohio.json json 244951 msgpack 184823 75% stations.json json 2185 msgpack 1648 75% test-location.json json 31890 msgpack 23957 76% average size 76% bullets.json json 543 msgpack 417 76% status-destroy.json json 473 msgpack 364 76% us-state-centroids.json json 7849 msgpack 6009 79% track.json json 2570 msgpack 2048 79% us-counties.json json 901283 msgpack 718329 80% direct_messages.json json 240 msgpack 192 80% public_timeline.json json 9488 msgpack 7670 81% us-states.json json 88430 msgpack 71888 81% user_timeline.json json 356 msgpack 291 82% initial_data.json json 4563 msgpack 3748 82% turkers.json json 1866 msgpack 1533 83% friends_timeline-kesuke.json json 9045 msgpack 7585 84% KMER-df30.json json 321 msgpack 272 85% friends.json json 17796 msgpack 15233 85% world-countries.json json 252515 msgpack 216432 86% KLVK-df30.json json 322 msgpack 278 86% KRHV-df30.json json 319 msgpack 276 86% followers.json json 29176 msgpack 25124 86% trip.json json 25665 msgpack 22184 87% flare-imports.json json 34321 msgpack 30201 87% friendship-create.json json 468 msgpack 408 87% friendship-destroy.json json 468 msgpack 408 87% replies.json json 3298 msgpack 2895 87% show-dewitt.json json 722 msgpack 630 88% KAPC-df30.json json 322 msgpack 286 88% featured.json json 10056 msgpack 8854 89% KCCR-df30.json json 324 msgpack 289 89% KMRY-df30.json json 327 msgpack 293 89% KPAO-df30.json json 317 msgpack 283 89% KSTS-df30.json json 324 msgpack 291 89% KWVI-df30.json json 330 msgpack 296 90% KOAK-df30.json json 344 msgpack 313 91% KMAE-df30.json json 326 msgpack 299 91% KNUQ-df30.json json 317 msgpack 290 91% KVCB-df30.json json 328 msgpack 301 92% KHWD-df30.json json 327 msgpack 303 92% KMHR-df30.json json 332 msgpack 306 92% KSFO-df30.json json 348 msgpack 321 92% KSJC-df30.json json 334 msgpack 308 92% KSMF-df30.json json 339 msgpack 312 92% KSNS-df30.json json 326 msgpack 300 92% KSQL-df30.json json 317 msgpack 293 93% KMCE-df30.json json 328 msgpack 307 93% KMHR.json json 7656 msgpack 7182 93% KMOD-df30.json json 327 msgpack 307 93% KPAO.json json 7372 msgpack 6891 93% KSAC-df30.json json 333 msgpack 313 93% KSAC.json json 7485 msgpack 7035 93% KSCK-df30.json json 336 msgpack 314 93% KSUU-df30.json json 341 msgpack 320 94% KCCR.json json 7465 msgpack 7034 94% KMOD.json json 7388 msgpack 6953 94% KRHV.json json 7106 msgpack 6714 94% KSFO.json json 7710 msgpack 7269 94% KSQL.json json 7615 msgpack 7199 94% KSUU.json json 7555 msgpack 7160 94% district.json json 53585 msgpack 50817 94% district.json json 53585 msgpack 50817 94% district.json json 53585 msgpack 50817 95% KAPC.json json 7541 msgpack 7220 95% KDVO.json json 7826 msgpack 7467 95% KHWD.json json 7646 msgpack 7329 95% KLVK.json json 7442 msgpack 7100 95% KMAE.json json 7582 msgpack 7231 95% KMCE.json json 7278 msgpack 6959 95% KMER.json json 7258 msgpack 6909 95% KOAK.json json 7662 msgpack 7332 95% KSCK.json json 7591 msgpack 7263 95% KSTS.json json 7548 msgpack 7210 95% KWVI.json json 7711 msgpack 7375 96% KMRY.json json 7760 msgpack 7454 96% KNUQ.json json 7701 msgpack 7433 96% KSMF.json json 7777 msgpack 7512 96% KSNS.json json 7700 msgpack 7426 96% KVCB.json json 7726 msgpack 7417 97% KSJC.json json 7758 msgpack 7567 99% test-usstates.json json 152305 msgpack 152245 99% test-worldborders.json json 911213 msgpack 911121 107% flowers.json json 14123 msgpack 15152 110% unemployment.json json 1383 msgpack 1532 120% world.json json 142051 msgpack 170748 121% unemployment.json json 39718 msgpack 48273 76% average size