lolslackbot postmortem

Had a significant outage for my lolslackbot project yesterday. A few different things went wrong and I’m still confused for what the problem is.

The behavior

The problem manifested as me seeing the same message being delivered every time the program runs, every 3 minutes. That’s bad; I’m spamming my users. At the same time I was seeing errors in my logs from trying to deliver messages via Slack. No useful message mind you, but at least a hint.

I was busy last night when I spotted the error so I just shut the whole system down until morning. Then in the morning I tried a quick fix and run the script but that went badly, so I had to look closer. I finally got it fixed after two hours of work.

The delivery bug

This morning first thing I did was add more logging and reproduce the problem. I discovered the error was one of the Slack channel IDs no longer existed, which caused an exception in the Slack messaging module, which then broke things. The underlying problem was a design flaw in my error handling; I was trying to deliver all Slack messages at once and only then updating the database indicating those messages had been processed. The result is if there were 3 messages to be delivered at once and the 2nd one caused an error, the 1st one would get delivered but not marked processed and so would get delivered again.

So I fixed it by refactoring the logic that marks messages processed. I still deliver all the Slack messages at once but now individually flag whether each one worked or not. I also mark a message processed whether there was an error in delivery or not. The underlying problem is basically a distributed transaction. I’d rather err on the side of occasionally losing a message than sending the same message many times.

Rate limiting problem / match commit semantics

A second problem making all this diagnosis difficult was that my system was downloading match objects but they weren’t ending up in the database. I finally figured out my script that downloads all missing matches was crashing before it finished. And I only was calling commit on the database when the script finished, so all the work was getting lost. Derp. I fixed it to now commit after every single match object is downloaded. Also put in some better error handling.

So what’s causing the errors downloading matches? I’m not really sure, but I think it’s Riot’s rate limiter. I have some very high rate limit that I shouldn’t be getting near, but I’m still getting 429 responses for my meagre stream of requests, being told to wait. And this problem has been going on for days. I had chalked it up to a networking problem with their servers, but it turns out it’s my client waiting politely like it’s been asked to. So why am I being throttled?

I don’t know. The thing that triggers it seems to be a few odd matches that are returning 404 errors indicating the match doesn’t exist. (Even though it should, since I saw a reference to it from another API call.) Perhaps they have extra rate limiting for clients that make repeated requests that generate 404s? Part of the problem here is that I treat a 404 as “no meaningful response, try again later”, so I’ve accumulated 10-15 of those over time. I should clear them out, and change the code to stop trying if it gets the 404.

Lessons learned

Man, debug logs are a big help. Fortunately the same time I was having this problem I’d just committed new code to write debug logs more usefully to a file. Couldn’t have figured out what was going on without it.

A broad thing I learned here is be smarter about error recovery logic when working with third party services. I think when interacting with Riot or Slack or whatever, I want to do one small bit of remote API work and then immediately commit that work to the database before trying the next remote thing. And handle errors from remote services robustly, continuing even if it fails.

Unfortunately some of my code is now squelching exceptions, logging them and continuing instead of crashing the program. This is necessary to make my code more robust to errors, but is scary. Anyway I found I was having a hard time logging exceptions properly, here’s the way I’ve settled on:

try:
    someFunction(data)
except Exception as e:
    logging.error('Something went wrong %s' % data, exc_info=True)

The key thing here is the “exc_info=True”; this gets Python to include a stack trace. Before I was trying to actually log the exception object e itself, but that only gets you the message, not the stack. My use of % is an anti-pattern, I’m really supposed to use a comma and let logging do the substitution, but for some reason I find that error prone. And the worst thing about errors in a logging function like this is unless you are superhuman you often don’t have test coverage for the exception cases, so this line of code only ever executes in production when something else already went wrong and it’s very confusing.

One thought on “lolslackbot postmortem

  1. There’s also a logging.exception() that I think is equivalent to the .error call you’re using but with a bit less cruft. else and finally are also handy options in a try block that I often find myself forgetting about until cleaning up some needlessly convoluted error-handling mess I’d made.

Comments are closed.