Simple app metrics: stathat

I’m looking for some simple application metrics. I want to track some basic things like “time it takes to make an API call” and maybe “number of interesting events over time”. The big boys awhile back converged on statsd + graphite for this, or now Grafana, and those are great technologies. They’re also too complicated for my little project. I just want something simple that will work. And hopefully nicer than rrdtool+mrtg :-/

Ian D recommended StatHat so I gave it a try. It’s pretty nice. You make REST HTTP calls giving it data and it aggregates the data and makes reports. It’s pretty basic, but then it’s simple to use and requires no extra work on my part.

The biggest drawback is that it requires an HTTP call in your app, which tends to be synchronous. The supplied client libraries don’t even have timeouts! I think this can be worked around with some async or thread implementation, or maybe even an offline stats collector, but there was nothing off the shelf.

I also found posting one stat value at a time with an HTTP call is very slow. I was able to post approximately 8 records a second. And I think that’s using HTTP keepalives! Fortunately there’s a bulk import API using JSON that is much faster; about 999x faster if I insert 1000 records at a time.

Data records have an optional timestamp. Those have to be an integer. If you pass in a floating point with milliseconds like 1460076429.905 it will be silently ignored and the data will be timestamped with the current clock time instead.

The biggest problem I’ve run into is that my data is getting lost. It looks just fine when I import it, then an hour or two later my data is gone. I’ve written asking if this is something I’m doing wrong. Update: my data reappeared. There’s some confusing things that happen around updates, etc, maybe the consistency is just eventual.

Here’s a sample report. The number being measured is the milliseconds it takes for Riot’s Web API to answer a request. This tells me the average time is 320ms and it spiked around 9AM. The reported min, max, 95% and 99% don’t seem correct to me, but maybe those are describing the averages in 15 minute windows instead of individual events. (Which makes it less useful).

It’s a neat product though. I may stick with it if I can figure out the data loss problem. Sure beats setting up some complicated stuff myself.

Screen Shot 2016-04-08 at 3.26.45 PM.png

Update: I’m using stathat in production now and am very happy with it. I just collect all the stats I care about in memory during a run of my job, then post them all at the end with a 10 second HTTP timeout. Seems to work just fine. The disappearing data problem seems to only be a temporary quirk in an initial bulk import, hasn’t been an issue since. And it’s really nice having the charts!

4 thoughts on “Simple app metrics: stathat

  1. Oh New Relic is the gold standard for this sort of app monitoring; everything I hear about it is great. But it’s significant overkill for my needs. Also it’s quite expensive for a hobby project and if I understand the confusing website the free level isn’t much good.

    Really I’d prefer to self-host this rather than use an external service. Just looking for a simple self-hosted option. Munin is the right level of complexity but it’s awfully old and ugly and doesn’t provide some of the fancier stats I’d like to see. OTOH it’s working fine for me now, maybe I should just go with it.

  2. About the need for http calls in your app to record analytics events – you might find just writing to stderr/stdout keeps things very straightforward and testable/debuggable and while giving you lots of very reliable tools to handle everything else in a decoupled way. Logfiles? redirect the output. Log management? logrotate. Needs to run long-term? Everything from nohup through screen to daemontools and runit works nicely. Metrics? Tail the log, call your SaaS provider, stuff into your favourite self-hosted solution, etc. It lets you make Unix deal with buffering, asynchronicity, concurrency which, unlike Python, it’s actually good at.

    Conversely (drifting a bit off-topic), it’s bad at locking, atomicity, transactionality, idempotence so let your db do that. If your components behave transactionally, it’s possible to replace cron (when your scheduling is at the minutes rather than ‘3 am every second Tuesday’ level) and lockrun with ‘sleep’ and ‘periodically kill -9 all the things’.

  3. Yeah I generally embrace The Unix Way when doing my projects, and agree with what it’s good at and bad at. In this case I don’t really want the overhead of another text logfile for recording ephemeral timing statistics; I’d rather just ship the numbers off to a graphing component and be done with it.

Comments are closed.