I’m playing with using MongoDB to store a bunch of wind time series data for my wind map. First time trying it out. Some quick notes:
I have a bunch of data that’s 4-tuples like this:
“KSQL”, 2011-09-23 11:47:00, 200, 7
That says that at KSQL at a specific time the wind was 7kts from 200 degrees. In an ideal encoding each row would take 10 bytes. In a comfortable encoding it’ll take 20. I’ll settle for 50 bytes per row in the unindexed database. MongoDB is taking 67 bytes per object. Or more!
Sample input: KDTW and KMDS. 116,000 rows total, 2 megs of CSV, 400k of csv.gz. After import, Mono is reporting a 7.8 meg data size, or 68 bytes / object. That’s with very short element names (“d”, “s”, etc). If I use long element names (“direction”, “speed”) it balloons to 87 bytes / object. And yes, the extra 19 bytes is exactly the # of characters I added to my element name.
Lame! Apparently Mongo compresses nothing, it doesn’t even try to use a lookup table to avoid storing a zillion copies of element names? (See this user request to tokenize field names.) That seems very bizarre; not only would a simple table save space, wouldn’t it save time by making the memory working set smaller?
I don’t have any need for queries across multiple station names. So instead of having one giant table (er, “collection”) for all my data, I’m going to make one collection per station. That’ll give me some 5000 collections. I’m still at 47 bytes per row, which I said was acceptable but seems awfully wasteful. I’m beginning to wonder if MongoDB is a good match for this data. All those ObjectIds aren’t doing me any good, that’s for sure.