I’m about to go on vacation. I just finished a bunch of improvements to lolslackbot. Also I suffered my first real outage since it started as a simple cron job six months ago. So I’m reflective on what I need to do to graduate this service into something more serious and reliable. Some projects towards making the system production quality…
Locking and timeouts. The outage was because I tried grafting lockrun into my cron job to ensure only one update runs at a time. But if the update script hangs (say, because the remote API is down) then it holds the lock forever, or at least until I wake up in the morning. I’ve removed the locking for now, running the update twice at once isn’t too bad. But I’m considering adding back the locking along with some sort of global timeout to the scripts, or at least the network requests, so they can’t hang forever. But that’s a hack on top of a hack. “Run this once and only once, reliably” is a hard problem.
Error recovery. I’m a big believer in throwing exceptions when things go wrong, then fixing the code to handle the error. But at some high level I’d rather the script kept going. I’m processing hundreds of game objects a day; it’d be better to skip a game that’s weird and breaks my code than to have the whole system fail until I can fix it.
Logging. I’ve been using Python’s logging program but I’m not logging to a file anywhere; just stderr. That makes it hard to look back at what happened. Or even notice if I skipped a weird game. I might also benefit from some contextual logging, but that may be overkill.
monitoring. I don’t exactly need a pager for 3am, but I would like some better way of knowing if stuff is working than “did cron email me?” and “have any new messages shown up in awhile?”
virtualenv. I’m naughty and don’t use venv. I should.
make code a module. I’m also naughty and don’t have my own code in a proper module. I should. Would be nice to separate the Python scripts from the precious data files, too.
test suite. A lot of my tests run based on some live data I captured a month or two ago and has baked in it things like “there are exactly 110 game objects”. That makes it hard to add new test cases. I think I’m stuck here and need to just suck it up and hand-edit in new test cases to the JSON blobs I use in my mock framework. And then update all the tests. I also had a hilarious problem yesterday where one of my test cases had enshrined the wrong expected output, basically hiding a bug I sort of knew about. That’s the hazard of starting a test suite with “I think this code is working; let’s assume its current output is correct”. Still that test suite is the single best investment I’ve made in the project, so much value.
indices / SQL performance. I haven’t done any work towards creating indices for common queries other than the implicit indices that primary keys and unique constraints give you. So far my whole database is only 20MB, hardly worth worrying about. But at some point table scans are going to stop being fast. I haven’t seen any evidence of a slow query log in sqlite, I suppose I could build my own wrapper in Python.
Distributed transactions. A closely related problem to locking; I have a sort of distributed transaction problem when I’m delivering messages to Discord. I want to mark the message delivered in my database if the message is successfully delivered to Discord. So I write the “message processed” bit after Discord accepts the message. But sometimes Discord throws an error even though the message got delivered anyway, so the message doesn’t get marked processed and later is delivered a second time. It’s sort of a distributed transaction problem I don’t quite know how to solve.
Web view. Big project, I think I’ve graduated to needing a web view for seeing and editing the objects in my database. Particularly for stuff like “add a player to a team”. I really don’t want to roll this all ad-hoc, way too much work. Currently thinking of doing this as a whole new codebase not reusing much (if any) existing code, to try out Flask and SQLAlchemy. With a plugin to generate the editing UI, maybe WTForms or Flask-Admin.
sqlite3 vs postgres. I continue to have the feeling that I should switch to postgres, but I’m now pretty in love with sqlite’s simplicity. The planned web view will be my first time with simultaneous writers, which previously I assumed was when sqlite may be too limiting. OTOH sqlite does have database-level locking, it’s not going to blow up. As long as the locks are only held for “a few milliseconds” I can live with that. Python’s driver has a 5 second timeout for connections by default. The sqlite3 command tool by default just throws an error if the database is locked, but you can specify a timeout in a somewhat awkward way. (If sqlite3 is compiled wrong, locks can last seconds.)
So that’s a big list of projects. I’m not thrilled to do any of them; they’re not features, and they’re a significant investment for a project that has very few users. OTOH they will make my life better as I work on this.