Nelson's log

Simple app metrics: stathat

I’m looking for some simple application metrics. I want to track some basic things like “time it takes to make an API call” and maybe “number of interesting events over time”. The big boys awhile back converged on statsd + graphite for this, or now Grafana, and those are great technologies. They’re also too complicated for my little project. I just want something simple that will work. And hopefully nicer than rrdtool+mrtg :-/

Ian D recommended StatHat so I gave it a try. It’s pretty nice. You make REST HTTP calls giving it data and it aggregates the data and makes reports. It’s pretty basic, but then it’s simple to use and requires no extra work on my part.

The biggest drawback is that it requires an HTTP call in your app, which tends to be synchronous. The supplied client libraries don’t even have timeouts! I think this can be worked around with some async or thread implementation, or maybe even an offline stats collector, but there was nothing off the shelf.

I also found posting one stat value at a time with an HTTP call is very slow. I was able to post approximately 8 records a second. And I think that’s using HTTP keepalives! Fortunately there’s a bulk import API using JSON that is much faster; about 999x faster if I insert 1000 records at a time.

Data records have an optional timestamp. Those have to be an integer. If you pass in a floating point with milliseconds like 1460076429.905 it will be silently ignored and the data will be timestamped with the current clock time instead.

The biggest problem I’ve run into is that my data is getting lost. It looks just fine when I import it, then an hour or two later my data is gone. I’ve written asking if this is something I’m doing wrong. Update: my data reappeared. There’s some confusing things that happen around updates, etc, maybe the consistency is just eventual.

Here’s a sample report. The number being measured is the milliseconds it takes for Riot’s Web API to answer a request. This tells me the average time is 320ms and it spiked around 9AM. The reported min, max, 95% and 99% don’t seem correct to me, but maybe those are describing the averages in 15 minute windows instead of individual events. (Which makes it less useful).

It’s a neat product though. I may stick with it if I can figure out the data loss problem. Sure beats setting up some complicated stuff myself.

Update: I’m using stathat in production now and am very happy with it. I just collect all the stats I care about in memory during a run of my job, then post them all at the end with a 10 second HTTP timeout. Seems to work just fine. The disappearing data problem seems to only be a temporary quirk in an initial bulk import, hasn’t been an issue since. And it’s really nice having the charts!