Comparing Mongo and Postgres for wind calculation

I’m playing around with my weather data, CSV files containing things like “KSQL 1136073600 270 10”. (Station name, timestamp, wind direction, wind speed). Each row is 10 bytes of honest data and I have oh, 300 million rows.

I would like a sophisticated query language for doing aggregate stats, like “show me a histogram of directions for winds > 5kts on Tuesday mornings”. I have been hacking reports up by hand in Python by loading CSV files into lists and doing lots of dict and slices, which is just awful.

Insert

I tried Mongo first because it was new and fun, but I think its query facilities are too limited to be useful to me. More on that in a bit. But it’s sure fast for inserts: 30,000 rows a second, 47 bytes / row on disk. It’s also fun to work with a schema-free database, it always takes me 20 minutes to get a SQL create table right.

Postgres is slower on inserts, about 10,000 rows a second and 52 bytes / row. (The Postgres FAQ says there’s about 28 bytes of overhead per row.) This is with naive Python executemany() insert; a bulk CSV import would be faster.

But I don’t much care about speed, and less about scalability. This is all one off calculations I’m running by myself. And I’m not equipped to do proper benchmarking anyway.

One caveat: I’m doing all of this measurement without indices. Well, not entirely, Mongo gets a sort of implicit index because I’m storing each station in a separate collection. In the real world of course you’d want an index on station, maybe also on aspects of the timestamp.

Queries

Mongo has very limited query support. It has a basic query predicate language that’s awkward and verbose, no date and time calculations, and precious little aggregation capability like SQL’s AVERAGE() operator. You can write all your queries in Javascript to run on the server. And Mongo has a nifty mapreduce capability for big processing. Only mapreduce doesn’t actually run in parallel in a single Mongo instance, so there’s not much benefit to that computing paradigm unless you have a sharded database.

The big drawback with Mongo is a lot of code to do something that’d be a one liner in SQL. For instance, here’s getting the average wind speed at a station.

# select average(s) from KSFO
# This map reduce version is slow, about 1s on 50,000 rows. It's single threaded :-(
# http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-Parallelism

# Map function: turn a document into the value (1, speed)
# The key for this map is a constant, the number 1: we're summing over the whole collection
mapFunction = bson.Code("""
function() {
  emit(1, { count: 1, sum: this.s });
}""")

# Reduce function: sum up the counts and sums
# Note this can be called on its own output, so have to match types from map
reduceFunction = bson.Code("""
function(key, values) {
    var result = {count: 0, sum: 0};

    values.forEach(function(value) {
      result.count += value.count;
      result.sum += value.sum;
    });

    return result;
}""")

# Execute the query and grab the results. Up to us to do the average calculation.
result = coll.map_reduce(mapFunction, reduceFunction, "myresults")
total = result.find_one()["value"]["sum"]
count = result.find_one()["value"]["count"]

print "Average speed: %.1fkts (%d samples)" % ((total / count), count)

That’s a lot of code just to do an average! Code I have to test, and maintain. It’s not particularly fast, either, 1000ms.

By contrast Postgres really is just one line of code for averages select avg(speed) from winds where station = 'KPAO'. And SQL has great complex date math, stuff I’d have to write myself for Mongo. Postgres is faster on this query, too: 600ms to do the average and that’s on a full table scan of millions of rows without even an index on the station. (It goes to 5ms with the index).

Conclusion

I’m not trying to say Postgres good Mongo bad, not at all. But Mongo’s not a good match for what I’m doing. I don’t have terabytes of data I need to shard across multiple servers. I don’t need sophisticated schema agility. I don’t really need relations, either. All I really need is efficient storage and a fast, convenient query language. Postgres and SQL are pretty good at that.