Putting lolslackbot on the back burner

I’ve decided to stop working on lolslackbot, my social project for League of Legends players who use Slack or Discord. I wrote it originally for a few friends, then slowly expanded it to a few hundred users. But I’ve never put the work in to make it a consumer product and now am not motivated to do it.

The main feature missing is any sort of web interface so people could sign up for themselves. I’ve been maintaining it by hand with database update scripts, doing ~30 minutes of one-off work every few weeks instead of one focussed month-long engineering project. This blog is full of bold plans to port the whole thing to Django and get going on a web interface, but I never did it. Too much product work I don’t really know how to do well, designing interactive web UI. Hell, I don’t even have a proper name for the project.

Also some deeper technical problems. The Django port seems doable but requires database schema changes, specifically in how many-to-many relations work. And I got part of my core schema wrong, an assumption that an individual only belongs to one group. Fixing that would require redoing pretty much all the tests and half the business logic. Also at some point I’d have to migrate from sqlite to Postgres and that doesn’t sound like fun at all. In retrospect it’s too bad I didn’t start with Postgres+Django, but that seemed complicated at the beginning when I was thinking of this as just a cron job.

My real reason for lack of enthusiasm is the market. I like games and I like the idea of making game playing more social. But League of Legends is a hard community to build humble tools for. Most of the energy there is to highly polished and well marketed sites like LolKing and I’m just not that ambitious. There’s not much money in it (Riot’s API requires you don’t charge for services) and not a lot of love either. Me and my gaming buddies are on a bit of a LoL break too, which makes it harder to stay personally motivated. I’m also bummed that Riot hasn’t done anything more with Clubs, their social feature, my hope was to springboard off of that to build out the bot.

I did get some data from the last user population of Learning Fives, a cohort of ~80 people playing games together for a few weeks. 50% said they found it useful, 20% said it wasn’t, and 30% didn’t know what it was (despite seeing it in their channel). Not sure what conclusion to draw from that.

Anyway it’s a weight off my mind to just say I’m not going to do further work on this, at least for now. Truthfully my mind is on political work right now, I’d really like to do some sort of progressive activism combining data processing and GIS. (I’m following Mike’s work on redistricting closely.) To the extent I do anything for games it’s about time to revisit Logs of Lag, which 2.5 years later is still running just fine and uniquely useful. But I have some bugs to fix and maybe some improvements to make.

 

Moving forward with lolslackbot and Django

I’m very encouraged with how my little Django experiment has worked out. I’m ready to start using it with lolslackbot. A big breakthrough was realizing that I can start using Django as soon as I trust it not to corrupt my database. Keep the existing Python cron job in old code, not using Django, and then have a separate webapp that also uses the same database. The cron job and the webapp will be reading the same tables but in general not writing the same tables. It seems like a good transition plan.

Only now I really have to face the sqlite question. I’m confident that ultimately I need to be using PostgreSQL. I’m envisioning thousands of web users updating the database; there’s no way that works with sqlite’s database-level locking. Even with low traffic the cron job and the webapp will be stepping on each other. I need to switch. But when?

Do I switch databases first, porting my cron job code over to a Postgres-backed system? That seems really painful and not much fun. Lots of backend work with no visible features.

Or do I use Django first, having it talk to the sqlite database? The lock contention won’t be a real problem if I’m the only webapp user. It lets me build fun / useful features sooner. And it gives me some more experience with data modelling. I’m pretty sure I’m going to want to refactor the schema along with the Postgres update, it’d be better to do that after doing some of the webapp work. The risk with Django-first is I do a lot of sqlite-specific work that is ultimately wasted. But Django’s ORM insulates you from the underlying database pretty well, so maybe that doesn’t matter?

Also curious how the Django ORM is going to work for me. I use a couple of non-standard SQL things now. Mostly “insert or replace”, which is sqlite’s upsert-like extension. Does the ORM expose those? Also very curious how testing will work. sqlite is so simple when managing test environments. But I know the Django folks have thought this through.

Django learnings: models

I’m soldiering on with trying to apply Django to my lolslackbot project, I thought I’d take a stab at letting Django try to use my existing database. My specific goal is to set up a read-only set of views on the primary social tables I have. People, Groups, Destinations all name entities in my system, I also have GroupMembership and Subscriptions tables which are many-to-many relations between the three primary tables. So far so good, at least for the primary tables; still working on the relations

Turns out starting with a legacy database isn’t too hard; you can use the inspectdb command to build skeletons of Django model classes from an existing database. Then hand-edit the resulting code. As a bonus this is an excuse for me to start learning more about Django models. Some random things I learned:

  • Django model classes define both a database schema and the UI validation behavior in HTML forms.
  • Every Django model class must have a primary key. The Python field is named “id”, not sure renaming it is possible or a good idea.
  • A model can have the option “null” set to true or false; this is whether empty values are stored as nulls in the database. (Why would you ever not?!) There’s also a “blank” option which has nothing to do with the database, but is whether the field is optional in autogenerated forms like the admin interface.
  • You have to hand-add each model class to admin.py to get it to show up in the admin interface.
  • Django coding style is lowercase_with_underscores for field names, CamelCase for class names. ¿Porque no los dos?

That’s the easy stuff. On to hard stuff.

I’m not sure how to translate my existing many-to-many relations tables to however Django implements relationships. I thought maybe the extra fields support (using the through keyword argument) might do it, but that seems like overkill and possibly awkward. I think I need to just adapt the ManyToManyField to my existing schema.

I’m a little scared to let Django start writing data to my database.

inspectdb sets up things with managed = False by default. That seems wise; it prevents Django’s schema management stuff from messing with tables someone else defined already. But some day I’m going to want Django to take all this over for me, can I later change managed to True and make sense out of it?

I haven’t even begun to think about testing in the Django world. I know there’s a lot of support for tests, maybe that’s my next learning.

lolslackbot postmortem

Had a significant outage for my lolslackbot project yesterday. A few different things went wrong and I’m still confused for what the problem is.

The behavior

The problem manifested as me seeing the same message being delivered every time the program runs, every 3 minutes. That’s bad; I’m spamming my users. At the same time I was seeing errors in my logs from trying to deliver messages via Slack. No useful message mind you, but at least a hint.

I was busy last night when I spotted the error so I just shut the whole system down until morning. Then in the morning I tried a quick fix and run the script but that went badly, so I had to look closer. I finally got it fixed after two hours of work.

The delivery bug

This morning first thing I did was add more logging and reproduce the problem. I discovered the error was one of the Slack channel IDs no longer existed, which caused an exception in the Slack messaging module, which then broke things. The underlying problem was a design flaw in my error handling; I was trying to deliver all Slack messages at once and only then updating the database indicating those messages had been processed. The result is if there were 3 messages to be delivered at once and the 2nd one caused an error, the 1st one would get delivered but not marked processed and so would get delivered again.

So I fixed it by refactoring the logic that marks messages processed. I still deliver all the Slack messages at once but now individually flag whether each one worked or not. I also mark a message processed whether there was an error in delivery or not. The underlying problem is basically a distributed transaction. I’d rather err on the side of occasionally losing a message than sending the same message many times.

Rate limiting problem / match commit semantics

A second problem making all this diagnosis difficult was that my system was downloading match objects but they weren’t ending up in the database. I finally figured out my script that downloads all missing matches was crashing before it finished. And I only was calling commit on the database when the script finished, so all the work was getting lost. Derp. I fixed it to now commit after every single match object is downloaded. Also put in some better error handling.

So what’s causing the errors downloading matches? I’m not really sure, but I think it’s Riot’s rate limiter. I have some very high rate limit that I shouldn’t be getting near, but I’m still getting 429 responses for my meagre stream of requests, being told to wait. And this problem has been going on for days. I had chalked it up to a networking problem with their servers, but it turns out it’s my client waiting politely like it’s been asked to. So why am I being throttled?

I don’t know. The thing that triggers it seems to be a few odd matches that are returning 404 errors indicating the match doesn’t exist. (Even though it should, since I saw a reference to it from another API call.) Perhaps they have extra rate limiting for clients that make repeated requests that generate 404s? Part of the problem here is that I treat a 404 as “no meaningful response, try again later”, so I’ve accumulated 10-15 of those over time. I should clear them out, and change the code to stop trying if it gets the 404.

Lessons learned

Man, debug logs are a big help. Fortunately the same time I was having this problem I’d just committed new code to write debug logs more usefully to a file. Couldn’t have figured out what was going on without it.

A broad thing I learned here is be smarter about error recovery logic when working with third party services. I think when interacting with Riot or Slack or whatever, I want to do one small bit of remote API work and then immediately commit that work to the database before trying the next remote thing. And handle errors from remote services robustly, continuing even if it fails.

Unfortunately some of my code is now squelching exceptions, logging them and continuing instead of crashing the program. This is necessary to make my code more robust to errors, but is scary. Anyway I found I was having a hard time logging exceptions properly, here’s the way I’ve settled on:

try:
    someFunction(data)
except Exception as e:
    logging.error('Something went wrong %s' % data, exc_info=True)

The key thing here is the “exc_info=True”; this gets Python to include a stack trace. Before I was trying to actually log the exception object e itself, but that only gets you the message, not the stack. My use of % is an anti-pattern, I’m really supposed to use a comma and let logging do the substitution, but for some reason I find that error prone. And the worst thing about errors in a logging function like this is unless you are superhuman you often don’t have test coverage for the exception cases, so this line of code only ever executes in production when something else already went wrong and it’s very confusing.

Stack traces from running Python programs

My lolslackbot program occasionally hangs forever and I’d like to know what failed when I kill it. Starting with Python 3.3 it’s relatively easy using the faulthandler built-in library. You have to set it up ahead of time (it’s not on by default), but once you do you can send the process a SIGABRT and it will display a slightly spartan stack trace and abort the program.

You can get a stack trace without killing the Python process by registering another signal, like faulthandler.register(signal.SIGUSR1). That signal is not entirely unobtrusive; it interrupts time.sleep() for instance. But the program does seem to keep running after printing the stack trace. All the signals registered by default seem to also kill your program; USR1 won’t. (Unless I’m confused.)

It’s all implemented in C to enable printing useful stack traces even if the Python VM itself is broken. Also you can enable it just by setting an environment variable, before any Python code is run, which I imagine is useful for debugging problems at startup.

There’s also an interesting faulthandler.dump_traceback_later() function which seems to basically be a watchdog. It sets a timeout with a separate thread that results in a stack trace dump and, optionally, your program exiting. It calls _exit() which is the hard exit, which has pluses and minuses.

It’s sure a lot easier than the only other way I knew to do this, which was to attach gdb and inspect the interpreter’s state. But I wish it were enabled by default like Java’s old built-in behavior on SIGQUIT (Ctrl-\). Maybe they were afraid it was too radical a backwards-incompatible change.

180 new lolslackbot users

I just set up lolslackbot for Learning Fives Session 2. Some 180 new users. Kind of an exciting day for me, before I only had 30 or so users.

My test infrastructure continues to pay off in spades. All sorts of firsts today. First time running with users a new region (Latin America). First time where I got a game where all 10 people were in my cohort I was tracking. Etc. And AFAICT it all worked as intended. Really no problems running it other than some scutwork and some unintended consequences.

The scut work is I still have no UI for administering users in my system. I spent half of today writing 150 lines of Python + tests to set up the new users. I have to create entries in five tables: People, Groups, Channels are all primary data entities, I also have “GroupMembership” and “Subscriptions” which are basically tables joining IDs from other tables. And all this crap is injected with command line tools. I do have scripts to at least populate them from CSV and JSON datafiles, it’s not raw SQL, but it’s close.

I really need a proper web GUI for editing these data entities, regrettably in Django, and I’m still dragging my feet on taking that leap. I have an idea for a new product I can do in Django though, a quick spike for something related to the existing code

The unintended consequences is I neglected to account for how unusual this first run would be after the import. Basically I was catching up, getting info on past games those 180 new users played. It took 15 minutes to download 10 match records for each of my 180 new users. And then that generated 177 messages which it immediately delivered, in a few cases spamming 10+ messages to a Slack channel. Those messages were all correct in some sense, but in retrospect it would have been better to suppress the past and only start delivering new messages instead. Oops.

2016 Python webapps

I want to build a webapp for my lolslackbot project. What do I use?

Django is the clear consensus choice for large, grownup projects. I used it once a couple years back and thought it was fairly good, if not exciting or lovely. But I have in the back of my head the Django ORM doesn’t play nice with homegrown schemas, that it really wants to control the schema itself. Also I have the impression that Django is kind of big and clunky, although maybe that’s unfair.

The cool kids all use Flask. I like the idea of a microframework, something simple. Flask doesn’t inspire a lot of confidence though. The last release is 0.10.1, released nearly three years ago. And while Python 3 is supported its use is discouraged. That may have made sense in 2013 but it’s a backward opinion in 2016 (IMHO). (The author reiterated that opinion as much as 8 months ago.) OTOH my friends who use Flask say not to worry about it, that it works fine with Python 3 and it’s simple.

Only no one just uses Flask. They combine it with Jinja2 templates, and WTForms, and SQLAlchemy for an ORM. Add in logins, sessions, and some CSS frameworks and you’re looking at a lot of software. Is it better to glue together your own custom assemblage of small components or should you just use a full framework like Django?

Full Stack Python has some useful advice.

My friend Brad G. recommended starting with cookiecutter-flask as a set of components + structure and customize from there. Man, it’s been years since I used a Visual Studio wizard! It looks like a good way to get up and running though.

Successfully installed Blinker-1.4 Flask-0.10.1 Flask-Assets-0.11 Flask-Bcrypt-0.7.1 Flask-Cache-0.13.1 Flask-DebugToolbar-0.10.0 Flask-Login-0.3.2 Flask-Migrate-1.8.0 Flask-SQLAlchemy-2.1 Flask-Script-2.0.5 Flask-WTF-0.12 Mako-1.0.4 SQLAlchemy-1.0.12 WTForms-2.1 WebOb-1.6.0 WebTest-2.0.20 Werkzeug-0.11.4 alembic-0.8.6 bcrypt-2.0.0 beautifulsoup4-4.4.1 cffi-1.5.2 cssmin-0.2.0 factory-boy-2.6.1 fake-factory-0.5.7 flake8-2.5.4 flake8-blind-except-0.1.0 flake8-debugger-1.4.0 flake8-docstrings-0.2.5 flake8-isort-1.2 flake8-quotes-0.2.4 gunicorn-19.4.5 isort-4.2.2 itsdangerous-0.24 jsmin-2.2.1 mccabe-0.4.0 pep257-0.7.0 pep8-1.7.0 pep8-naming-0.3.3 psycopg2-2.6.1 py-1.4.31 pycparser-2.14 pyflakes-1.0.0 pytest-2.9.0 python-editor-1.0 setuptools-20.2.2 testfixtures-4.9.1 waitress-0.8.10 webassets-0.11.1 wheel-0.29.0

Thinking for real about a webapp is making me once again consider whether SQLite is a good choice going forward. The lack of fine grained locks is going to make multiple writers unacceptable. I sure wish it had per-table locking, that would probably be good enough for now.

Python packaging in March 2016

Hooray! I finally finished porting my lolslacktest code over to be a proper Python module, installable with pip, and with a virtual environment. What a pain! Some notes on what I learned. But first, some reflection.

This project has been one of the least fun things I’ve done in a long time. I can tell by the way I’ve been foot dragging. It’s confusing and hard to debug packages. The docs are inconsistent and suffer from years of accretion. Some things are genuinely confusing, like Python’s surprise lack of support for circular imports. And in the end result is my product works just like it did before. No user visible changes. Behind the scenes things are better; the install is cleaner, I have a virtual environment for proper external dependencies, etc. But nothing fun, just slightly less crappy ops.

The state of the art for Python Packaging

As of March 2016 the Python 3 state of the art for package installation is pip, pyvenv, and setuptools. pyvenv and pip comes with Python 3.4 but setuptools is still extra. (On Ubuntu you have to install python3.4-venv with apt.)

The state of the art for managing Python packages keeps changing. This Stack answer explains the landscape as of September 2014. This packaging user guide is updated as of September 2015 and is mostly very good, but even it references things that don’t work such as “python setup.py bdist_wheel”. I also found this guide useful but it was mostly written in 2012 (although occasionally updated).

The pip installed by python 3.4 (Ubuntu) is version 1.5.4, you want version 8.0 or later if you want to do things like install gevent with a precompiled wheel file. setuptools is old too. So first thing you should do after setting up a pyvenv is “pip install -U pip setuptools”.

The setuptools docs are not very good. They assume you know how distutils works. Also they spend a lot of time talking about easy_install and none at all talking about pip.

Honestly all this Python packaging stuff is a big mess, I feel like you have to understand the history to be able to use the current state of the art. It’s a shame Distutils2 didn’t work out for Python 3.3 and rationalize it all. To be fair, packaging in almost every system sucks: npm, rvm, homebrew, they’re all a mess. All but my beloved apt, and that’s because the Ubuntu package management team works super hard.

Some hacks and code things

I had to do a few hacks and other things to make my project work.

The biggest code change was just wholesale replacing absolute imports like “import lib” with relative imports “from . import lib”. I think that’s actually correct, but I was doing it kind of blind for every single file.

I had to modify my code because Python really doesn’t support circular import dependencies. You don’t really notice until you are using relative imports. The accepted solution for circular imports is to break the circularity, refactor code. Screw that. My workaround was to move some of the import statements inside functions, so they execute at run time and not import time. That’s bad and inefficient but expedient.

There’s two similar-looking ways for a Python package to specify dependencies. You can add an install_requires stanza to your setuptools setup() method, or you can use pip to install a bunch of stuff listed in a requirements.txt file. The setup() dependencies are installed automatically when a package is installed, but nothing installs requirements.txt automatically. OTOH the requirements.txt option is more powerful, for instance pip can install things from github URLs whereas setuptools can’t. (The setuptools docs say you can do this with dependency_links, but I couldn’t make it work.) I’ve ended up using a mix of both, preferring setuptools where I can.

I have some shell scripts named in the “scripts” section of my setup.py, so they are installed in the virtualenv bin directory. But I want them to work even if the virtualenv isn’t activated, so those scripts have to source their own virtualenv. Mostly for execution in cron; getting cron to activate a virtualenv is not easy. The hack I did for this was this shim of code at the top of each script:

VENVDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
source $VENVDIR/activate

The bash magic in the first line sets VENVDIR to be the directory where the bash script itself is. Conveniently, that’s the same directory that has the activate script.

I have no idea how to put version numbers on my program. It’s a private thing only I’m installing, so for now I’m going with the dbschema version. Part of me wants to just put the git commit nonce there.

My deploy script used to be rsync from dev to prod. Now it’s sshing into prod and having it do a “git pull” followed by “pip install -U”. I took this approach at migurski’s suggestion. It means I’m not using any of the fancy distribution builds and versioning stuff that setuptools/pip enables. But I don’t really need those right now, they make more sense for public code hosted at PyPI

Note to self: you can’t move a pyvenv environment once its created. They have paths hard coded.

 

 

Python modules frustration

I’m trying to turn my lolslackbot code into a proper set of Python modules, so I can manage the code with pip and venv. That means turning my moduleless code into a bunch of packaged modules. And writing setuptools.

And oh god it’s awful. Python’s handling of module namespaces and imports is confusing, particularly when circular dependencies are involved and/or command line scripts as opposed to Python libraries. I can’t even tell if this caveat applies to me or not

Note that relative imports are based on the name of the current module. Since the name of the main module is always “__main__”, modules intended for use as the main module of a Python application must always use absolute imports.

The other problem is years of cruft that’s accumulated in Python packaging tools. setuptools, disttools, easy_install, pip, wheel, egg, virtualenv, pyvenv.. To some extent this has been rationalized in modern Python: use pip and setuptools and call it a day. But the docs that are out there are hard to follow because they tend to describe the history, not the current practice.

The two best documents I’ve found so far:

Right now I’m hung up on some code where “from . import lib” works in some contexts but not others. I think it may be related to circular dependencies but am not sure.

Update: turns out Python doesn’t actually support circular imports, despite looking like it does in many circumstances. So now I’m having to refactor all my code that worked just fine to not have module A import module B which imports module A. I feel like I’m programming in C again, only without #ifndef macros. One partial solution is to only do the imports inside the running function code. That avoids the circular imports at import time, at the expense of deferring the import code machinery running in your actual execution path all the time. Maybe I’m misunderstanding how to really make this work. (There’s a bunch of arrogant advice which says “if your code has circular imports you should refactor it.” No, imports are how you manage namespaces in Python, and Python shouldn’t have this limitation.)

Turning hacky Python into a real app

I’m about to go on vacation. I just finished a bunch of improvements to lolslackbot. Also I suffered my first real outage since it started as a simple cron job six months ago. So I’m reflective on what I need to do to graduate this service into something more serious and reliable. Some projects towards making the system production quality…

Locking and timeouts. The outage was because I tried grafting lockrun into my cron job to ensure only one update runs at a time. But if the update script hangs (say, because the remote API is down) then it holds the lock forever, or at least until I wake up in the morning. I’ve removed the locking for now, running the update twice at once isn’t too bad. But I’m considering adding back the locking along with some sort of global timeout to the scripts, or at least the network requests, so they can’t hang forever. But that’s a hack on top of a hack. “Run this once and only once, reliably” is a hard problem.

Error recovery. I’m a big believer in throwing exceptions when things go wrong, then fixing the code to handle the error. But at some high level I’d rather the script kept going. I’m processing hundreds of game objects a day; it’d be better to skip a game that’s weird and breaks my code than to have the whole system fail until I can fix it.

Logging. I’ve been using Python’s logging program but I’m not logging to a file anywhere; just stderr. That makes it hard to look back at what happened. Or even notice if I skipped a weird game. I might also benefit from some contextual logging, but that may be overkill.

monitoring. I don’t exactly need a pager for 3am, but I would like some better way of knowing if stuff is working than “did cron email me?” and “have any new messages shown up in awhile?”

virtualenv. I’m naughty and don’t use venv. I should.

make code a module. I’m also naughty and don’t have my own code in a proper module. I should. Would be nice to separate the Python scripts from the precious data files, too.

test suite. A lot of my tests run based on some live data I captured a month or two ago and has baked in it things like “there are exactly 110 game objects”. That makes it hard to add new test cases. I think I’m stuck here and need to just suck it up and hand-edit in new test cases to the JSON blobs I use in my mock framework. And then update all the tests. I also had a hilarious problem yesterday where one of my test cases had enshrined the wrong expected output, basically hiding a bug I sort of knew about. That’s the hazard of starting a test suite with “I think this code is working; let’s assume its current output is correct”. Still that test suite is the single best investment I’ve made in the project, so much value.

indices / SQL performance. I haven’t done any work towards creating indices for common queries other than the implicit indices that primary keys and unique constraints give you. So far my whole database is only 20MB, hardly worth worrying about. But at some point table scans are going to stop being fast. I haven’t seen any evidence of a slow query log in sqlite, I suppose I could build my own wrapper in Python.

Distributed transactions. A closely related problem to locking; I have a sort of distributed transaction problem when I’m delivering messages to Discord. I want to mark the message delivered in my database if the message is successfully delivered to Discord. So I write the “message processed” bit after Discord accepts the message. But sometimes Discord throws an error even though the message got delivered anyway, so the message doesn’t get marked processed and later is delivered a second time. It’s sort of a distributed transaction problem I don’t quite know how to solve.

Web view. Big project, I think I’ve graduated to needing a web view for seeing and editing the objects in my database. Particularly for stuff like “add a player to a team”. I really don’t want to roll this all ad-hoc, way too much work. Currently thinking of doing this as a whole new codebase not reusing much (if any) existing code, to try out Flask and SQLAlchemy. With a plugin to generate the editing UI, maybe WTForms or Flask-Admin.

sqlite3 vs postgres. I continue to have the feeling that I should switch to postgres, but I’m now pretty in love with sqlite’s simplicity. The planned web view will be my first time with simultaneous writers, which previously I assumed was when sqlite may be too limiting. OTOH sqlite does have database-level locking, it’s not going to blow up. As long as the locks are only held for “a few milliseconds” I can live with that. Python’s driver has a 5 second timeout for connections by default. The sqlite3 command tool by default just throws an error if the database is locked, but you can specify a timeout in a somewhat awkward way. (If sqlite3 is compiled wrong, locks can last seconds.)

So that’s a big list of projects. I’m not thrilled to do any of them; they’re not features, and they’re a significant investment for a project that has very few users. OTOH they will make my life better as I work on this.