My little Logs of Lag project is doing pretty well, I need to create some sort of more serious datastore for it. It’s a small service, at most 1000 logfiles uploaded a day (each about 20k per file), but over time that adds up. Right now my data store is “put a file with a random name in a directory”. And I have no database or analysis of all the files.
The high-end solution would be to start putting data files in AWS and create a local Postgres database for aggregate statistics. I know how to build this. But that adds two big moving parts, one of which I have to maintain myself. And it’s kind of overkill for the relatively small amount of traffic I have (and am likely to ever have).
So I’m thinking now just to keep storing the files themselves on the filesystem, but spread it out over a bunch of directories so I don’t have a single giant directory. The Linux kernel no longer chokes on lots of files but shell tools are still a PITA. File names are base 64 encoded things so maybe hash by the first two characters, for 4096 directories. That’ll carry me to about 16M log files (4096 files per directory), good enough. The file system only has 60M inodes available anyway. (Unfortunately – is a valid leading character in the filenames. Oops.)
What database? I should probably bite the bullet and just use Postgres, but I hate having to manage daemons. I wonder if SQLite is sufficient? It supports concurrent read access but only one writer at a time, and the writer blocks readers “for a few milliseconds”. I think that constraint isn’t a problem for me. Right now I’m tempted to go for that, just for the fun of doing something new.
(Naively I’d thought Node startup times would be bad, like Java, but that’s not true. From Node v0.10 on Ubuntu: “NODE_PATH=/tmp time node -e 1” is about 27ms, compare Python 15ms. Not enough difference to matter. strace shows Node makes about 230 system calls for an empty script (it’s nondeterministic!), compare Python’s 883.