Data storage for Logs of Lag

My little Logs of Lag project is doing pretty well, I need to create some sort of more serious datastore for it. It’s a small service, at most 1000 logfiles uploaded a day (each about 20k per file), but over time that adds up. Right now my data store is “put a file with a random name in a directory”. And I have no database or analysis of all the files.

The high-end solution would be to start putting data files in AWS and create a local Postgres database for aggregate statistics. I know how to build this. But that adds two big moving parts, one of which I have to maintain myself. And it’s kind of overkill for the relatively small amount of traffic I have (and am likely to ever have).

So I’m thinking now just to keep storing the files themselves on the filesystem, but spread it out over a bunch of directories so I don’t have a single giant directory. The Linux kernel no longer chokes on lots of files but shell tools are still a PITA. File names are base 64 encoded things so maybe hash by the first two characters, for 4096 directories. That’ll carry me to about 16M log files (4096 files per directory), good enough. The file system only has 60M inodes available anyway. (Unfortunately – is a valid leading character in the filenames. Oops.)

I definitely want a database, I’m curious about aggregate stats. I’m thinking of keeping this all asynchronous. When a user uploads a log it doesn’t go straight to the database, just written to a file. Then a cron job picks up the files and does the postprocessing. That works fine as long as the database data still isn’t needed by the user application. It’s not right now, in fact the server isn’t needed at all, the user gets 95% of the value solely in client side javascript.

What database? I should probably bite the bullet and just use Postgres, but I hate having to manage daemons. I wonder if SQLite is sufficient? It supports concurrent read access but only one writer at a time, and the writer blocks readers “for a few milliseconds”. I think that constraint isn’t a problem for me. Right now I’m tempted to go for that, just for the fun of doing something new.

Another problem to solve is the log parsing code. Right now logs are parsed by the client browser, in Javascript, and the client sends both the raw log text file and the parsed data (as JSON) to my server. (The server is a Python CGI which just writes the files to disk.) I’d like to retain that client-only capability, but also start parsing logs on my server. I don’t really want to maintain a second parser in Python.

So maybe I write the server parser / database stuff in Node to reuse the Javascript parser. Here’s MapBox’s node-sqlite. There’s zero need for me to make these scripts asynchronous, so the Node paradigm is not a great match, but I can certainly make it work.

(Naively I’d thought Node startup times would be bad, like Java, but that’s not true. From Node v0.10 on Ubuntu: “NODE_PATH=/tmp time node -e 1” is about 27ms, compare Python 15ms. Not enough difference to matter. strace shows Node makes about 230 system calls for an empty script (it’s nondeterministic!), compare Python’s 883.