Distributed job scheduling - Niels Hageman

Tags: pygrunn, python

Sisyphus, which they made at paylogic, is a semi-distributed job scheduling system.

Originally they used cronjobs. It is very robust and reliable. But it has limitations:

  • The configuration is strictly on one machine only. You cannot have a pool of machines where you choose one of them to execute it. So you’re limited regarding redundancy: a cronjob only runs on one machine.

  • A cronjob really looks at the time. If a task is scheduled for 7:00 and the server is down at that time, the cronjob does not run. Even if the server starts up again at 7:01.

  • You have limited monitoring. Cron doesn’t care if something crashes. You really have to monitor the logging.

So… they wanted something new! Sisyphus. The features:

  • A central configuration, so you can schedule on a pool of machines (any/all).

  • There is a configurable “time to live” for jobs: you can run the 7:00 at 7:02 if you think that’s still “fresh” enough.

  • Job results are saved, so you can monitor it way better.

  • There are configuration options for for instance maximum execution time (“if it takes more than 5 minutes, something is wrong”). And you can terminate remotely or suspend a job. And retry-on-failure is also an option.

  • There are a couple of interaction possibilities: SQL, commandline client, Python API.

The system depends on a central SQL database, so that part is not redundant. But that central database is the most critical part of Paylogic, so if the main database goes down, jobs that don’t run is hardly a priority :-) So this dependency is fine.

Sisyphus is divided into separate parts, like scheduler, dispatcher, configurator, etc.

Nice detail: sisyphus is actually run as a cronjob on all the active hosts (polling for jobs). If a job is available, a “guardian” claimes a process using a database lock. The job is run in a subprocess. The guardian keeps track of the job and takes care of the bookkeeping.

The current state: they are using it in production at paylogic. There are still some rough edges that need attention, however. It is not open source yet, but it will be.

Photo & Video Sharing by SmugMug
 
vanrees.org logo

About me

My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):