multiprocessing worker processes dying

Python’s multiprocessing.Pool has a design wrinkle that’s a bit awkward. If you have a pool with N tasks and one of those task subprocesses dies unexpectedly (say, to a SIGTERM or something) then the pool hangs. It looks like N-1 tasks have finished and there’s still one waiting. But it will never complete and your parent process will effectively be stuck. Note that normal termination doesn’t do that, including random exceptions, SIGINT from Ctrl-C, etc. A normal “kill” triggers this though, as certainly does a “kill -9”. Probably a segfault in the Python interpreter will too.

Personally I think this is a bad design choice, but it’s not by accident. There was a huge discussion about this behavior three years ago. I haven’t read it all, but most of the comments seem to be about the wisdom and difficulty of recovering from a bad state. The ticket got closed after someone committed some changes to concurrent.futures (Python 3’s preferred new library). Nothing changed in multiprocessing.Pool.

Recently this issue was revisited for multiprocessing.Pool with a new bug filed that includes a patch. The approach there (and concurrent.futures) is if a child dies unexpectedly, you want to kill the whole Pool immediately with a BrokenProcessPool exception. I’m not wild about this choice, but it’s definitely better than hanging.

None of this applies to the Python distribution we’re running today. The pool will hang. For OpenAddresses I suggest we work around the issue by simply not killing workers. If you want to abort a worker early, try SIGALRM. We could also install a SIGTERM handler to catch the simple “kill” case from an operator, but I’m not sure that’s wise.

In addition, OpenAddresses also has a SIGUSR1 handler that allows someone externally to shut down the whole pool. It’s good for recovering from this state.

Update: we hit this bug again in a new way. Some of the worker processes were getting killed by the Linux OOM Killer. The Python code doesn’t see any exception, it’s just a SIGTERM or something. Only way you know is a record in the syslog. (And the multiprocessing debug logs show a new worker started.)