Trying out Riak on MacOS with Homebrew

Homebrew install notes

I decided to try out Riak on MacOS, using Homebrew to install. Here’s some rough notes.

“brew install riak” worked. However, it didn’t leave you the Makefile you need to do “make devrel” in order to set up the 3 node test cluster. So I interrupted Homebrew during the install, copied the patched sources, did “make all rel devrel” in my own source tree, then copied the “dev/dev?” directories off. Still using the Homebrew binaries. Lot of work for 4 lousy config files :-P

The Python client you want is riak-python from Basho. Unfortunately “pip install riak” doesn’t quite work, because Pip first tries to install protobuf and there’s a three year old bug open on Google’s protobuf client about how it doesn’t install right in Python. Some helpful soul forked Google’s library and his version can be installed witih pip install “git+http://github.com/rem/python-protobuf.git@v2.4.1#egg=protobuf”

With that you have a working Riak and Python environment that you can use to work with the tutorials online.

Insert notes

Trying to insert my weather data: millions of rows of 4 tuples: (station name, timestamp, speed, direction). Testing with 3 Riak nodes running on my single fast iMac. This data is not a good match for document oriented stores: I already learned that Mongo was not good and expect Riak not to be, either. Still, it’s real data I have and I want to try it.

Wow, is Riak slow if you just use it as a naive user. 3 node cluster running on a fast iMac, inserting 45,000 records one at a time with a store() between each took 120 seconds.

Some write optimization.. there appears to be no batch insert in the Python library I’m using (although see this). Riak’s exposure of consistency via the W parameter is really neat, but what I need is to commit 10,000 records at once. I’ve tried optimizing my writes (ie: windData.store(w = 0, dw = 0, return_body = False)). Switching from HTTP to Protocol Buffers gives about a 4x increase. Still, no matter what I do, I can’t do better than 1900 records / second. Compare Mongo’s 30,000 records / second or Postgres’ 10,000 rows a second. Am I doing something terribly wrong?

Sadly, I get a 10% speed increase (from 1900 to 2100 records / second) if I only have one node in the cluster. Not useful in the real world, but interesting.

With just a single node running Riak seems to use about 1 kilobyte / document (650,000 rows, 720 megs on disk). Not good, since the docs themselves are like 20 bytes of data. Compare about 50 bytes / row for both Postgres and Mongo. With 3 nodes the combined disk usage is about the same size; I’m surprised, I thought the default config was three copies of all data.

I sure miss the ability to drop a whole bucket. I don’t really have a way to reset my database while testing.

Query Notes

Once again I petulantly note that raw map/reduce is awfully low level if you just want a damn SQL function. Of course it gets wonderful when you want to query using 100 machines and a distributed data store.

Trying to list 50,000 keys in a bucket (via /keys?keys=true) kills both Chrome and curl: they complain the HTTP header returned is too big. Also the docs warn you to never do this in production. In other words, queries over the whole dataset are difficult. I think you need to use links or some extra indexing system to do queries. Riak has a search capability, btw, but it looks more full text oriented than simple data iteration.

I wrote some simple mapreduce code to do “select count(*) from bucket” and “select count(*) from bucket where parameter > 30”. It takes 16 seconds to run over 60,000 rows, that’s just unacceptable. Also the answer it gives is wrong; 43561 instead of 59072. Huh. While it’s cool I could run this query while nodes were dropping out, it’s uncool that the answer it gave varied as data was moved around.

I don’t think I have a bug in my query code, but you never know… Here it is.

import riak

client = riak.RiakClient(port=8081, transport_class=riak.RiakPbcTransport)

# select count(*) from KSFO
query= client.add('KSFO')
query.map("function(v) { return [1] }")
query.reduce("Riak.reduceSum")
for result in query.run(): print result

query = client.add('KSFO')
query.map('''function(v) {
    var data = JSON.parse(v.values[0].data);
    if (data[1] > 20) {
        return [1];
    } else {
        return [];
    }}''')
query.reduce("Riak.reduceSum")
for result in query.run(): print result

Evaluation

I knew this data was a poor match for Riak and now I’m doubly sure. The 1k / row is the real problem. I don’t think Riak is bad or anything, quite the opposite. Just curious how it feels to try it out on some data.

I really admire how Riak is distributed to the core. Most datastores design for a single server, then later add replication and partitioning. It’s a total mess. Riak started distributed, and if they did it right it should be significantly more robust for real network usage. Also for parallel data analysis. I also think it’s freaking amazing how you can do “riak-admin join” and “riak-admin leave” and the data transparently repartitions while the cluster is still up and serving write requests.

I wonder how robust Riak is in recovering if a node drops offline without a clean shutdown, then joins later with inconsistent data? The docs on CAP mention some voting and validation protocols, so they’ve thought about it. More a question of convenience than usability.