Pygrunn: Reliable distributed task scheduling - Niels Hageman

Tags: pygrunn, python

(One of the summaries of the 2015 Pygrunn conference )

Note: see Niels Hageman’s somewhat-related talk from 2012 . Niels works at Paylogic . Wow, the room was packed.

They discovered the normal problem of operations that took too long for the regular request/response cycle. The normal solution is to use a task queue. Some requirements:

  • Support python, as most of their code is in python.

  • It has to be super-reliable. It also needs to allow running in multiple data centers (for redundacy).

  • Ideally, a low-maintenance solution as they already have enough other work.

Option 1: celery + rabbitMQ. It is widely used and relatively easy to use. But rabbitMQ was unreliable. With alarming frequency, the two queues in the two datacenters lost sync. They also got clogged from time to time.

Option 2: celery + mysql. They already use mysql, which is an advantage. But… the combination was buggy and not-production ready.

Option 3: gearman with mysql. Python bindings were buggy and non-maintained. And you could also run one gearman bundle, so multiple datacenters was out of the window.

Option 4: do it yourself. They did this and ended up with “Taskman” (which I couldn’t find online, they’re planning on making it open source later on: they still need to add installation documentation).

The backend? They started with mysql. It is a great relational database, but it isn’t a great queue. There is a saying on the internet: Thou shalt not use thine database as a task queue. With some adjustments, like autocommit, they got it working nicely anyway.

The task server consists of a python daemon (running under supervisor) and a separate task runner. It runs in a separate process to provide isolation and resource control.

Of course, the task server needs to be integrated in the main server. The task server is written as an independent application, so how does the task finder find the python functions it needs to run? They do this via “server plugins” that define which environment variables are needed, which python path you need and which function and which version you need. All this gets applied by the task runner and subsequently it can import and run the function.

Some additional features of their task runner:

  • Tasks can report progress.

  • Tasks can be aborted.

  • Task start time can be constrained.

  • There’s exception handling.

Some of the properties of taskman: it is optimized for long running tasks. And: it is designed for reliability. Very necessary, as Paylogic is a payment processor.

It also means it is less suited when you have lots of little tasks. Running everything as a separate process is fine for longer-running processes, but it is too heavy-weight for lots of small tasks. Oh, and there’s no admin UI yet: he uses phpmysqladmin :-)

 
vanrees.org logo

About me

My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):