Pygrunn: Kliko, compute container specification - Gijs Molenaar¶

(One of my summaries of the one-day 2016 PyGrunn conference).

Gijs Molenaar works on processing big data for large radio telescopes (“Meerkat” in the south of Africa and “Lofar” in the Netherlands).

The data volumes coming from such telescopes are huge. 4 terabits per seconds, for example. So they do a log of processing and filtering to get that number down. Gijs works on the “imaging and calibration” part of the process.

So: scientific software. Which is hard to install and fragile. Especially for scientists. So they use ubuntu’s “lauchpad PPA’s” to package it all up as debian packages.

The new hit nowadays is docker. Containerization. A self-contained light-weight “virtual machine”. Someone called it centralized agony: only one person needs to go through the pain of creating the container and all the rest of the world can use it… :-)

His line of work is often centered around pipelines. Data flows from one step to the other and on to the next. This is often done with bash scripts.

Docker is nice and you can hook up multiple dockers. But… it is all network-centric: a web container plus a database container plus a redis container. It isn’t centered on data flows.

So he build something new: kliko. He’s got a spec for “kliko” containers. Like “read your input from /input”. “Write your output to /output”. There should be a kliko.yml that defines the parameters you can pass. There should be a /kliko script as an entry point.

Apart from the kliko container, you also have the “kliko runner”. It is the actor that runs the container. It runs the containers with the right parameters. You can pass the parameters on the command line or via a web interface. Perfect for scientists! You get a form where you can fill in the various parameters (defined in the kliko.yml file) and “just” run the kliko container.

An idea: you could use it almost as functional programming: functional containers. Containers that don’t change the data they’re operating on. Every time you run it on the same input data, you get the same results. And you can run them in parallel per definition. And you can do fun things with caching.

There are some problems with kliko.

There is no streaming yet.
It is filesystem based at the moment, which is slow.

These are known problems which are fine with what they’re currently using it for. They’ll work on it, though. One thing they’re also looking at is “kliko-compose”, so something that looks like “docker-compose”.

Some (fundamental) problems with docker:

Docker access means root access, basically.
GPU acceleration is crap.
Cached filesystem layers is just annoying. In first instance it seems fine that all the intermediary steps in your Dockerfile are cached, but it is really irritating once you install, for instance, debian packages. They’re hard to update.
You can’t combine containers.