Update: this doesn’t work, see below.
I still need a better solution for running lolslackbot. Right now it’s a shell script that runs 7 Python programs that run sequentially and usually complete in about 25 seconds. I run it from cron once a minute. The problem is sometimes one script takes too long, typically because a remote API server is hung. Also it’s better (but maybe not essential?) if the script doesn’t run twice at once.
So I set up lockrun in the cron job so that only one instance of the job can run at once. This works pretty well, but lockrun has no solution for timing out a dead job. Despite my best efforts, twice now the script hung for hours (forever?) and my system stopped working.
(I can’t even figure out how to monitor lockrun’s lockfile. It creates the file once and then never touches or updates it ever again, so the file time never changes. I could monitor my underlying program’s logfile or something I guess.)
So now I’m wrapping some bailing wire around the duct tape and using timeout to add a timeout.
timeout 3 lockrun --lockfile=/tmp/lf1 -- sleep 10
Yeah, I know. But it seems to work. What could possibly go wrong?
I think the right solution is ensure my scripts are safe to run twice at once. I don’t think this is too hard semantically, and I do have a database with transactions to help mediate things. The trick is the occasional faux distributed transaction for stuff like “if I find a new row in the database, update some remote service and then mark the row as processed”. I guess I could add three phase locking for that? Ugh. While I’m at it I should make it so the 7 Python programs can run independently, at different frequencies. The very first one is the important one, the slow one, asking “is there any activity for any of these N users?” It’d be nice to run that script in parallel and somewhat out of lockstep sync.
While I’m here I wanted to give a shout-out to cronic, yet another wrapper for cron jobs that improves error reporting. cron mails you if there’s any stdout or stderr, but ignores exit status. cronic fixes it so it mails you if there’s stderr or if the exit code indicates an error. But it ignores stdout. Seems particularly useful in concert with timeout; if timeout kills a job the only way you know is exit status, which cron without cronic ignores.
Update: this timeout + lockrun thing doesn’t work. If the timeout fires then timeout kills the child process, lockrun in this instance. But that kill signal doesn’t propagate to the children. A potential solution would be to kill the whole process group but I don’t see an easy way to do that. I’m also wary of adding hacks on hacks because of the warning in the lockrun docs about how there’s no kill option in lockrun because killing things correctly is hard.
Update 2: after 9 months I think my current solution is stable. I use lockrun to be sure only one version of the cron job is running. It runs a Python program, not a shell script, that runs all my separate Python subprograms (by invoking the main() function on each.) It uses faulthandler.dump_traceback_later to implement a last resort global timeout on the whole Python program. It all feels wrong, but in practice it’s been quite robust.