Docker-compose autoheal: case sensitivity¶

Tags: geo, django, nelenschuurmans

I use docker-compose quite a lot. Most of the python/django stuff we deploy is done with docker-compose (one of the two big ones is in kubernetes already). A while back I moved several “geoservers” to docker-compose. Geoserver is a web mapping server written with java/tomcat. Normally pretty stable, but you can get it to crash or to become unresponsive.

So that’s something for which docker’s health check comes in handy. You can configure it in docker-compose itself, but I put it in our geoserver’s custom Dockerfile as I was making some other modifications anyway:

FROM docker.osgeo.org/geoserver:2.23.1

... some unrelated customizations ...

HEALTHCHECK --interval=20s --timeout=10s --retries=3 --start-period=150s \
    CMD curl --fail --max-time 3 http://localhost:8080/geoserver/web/ || exit 1

A simple “curl” command to see if the geoserver still displays its start page. With a generous --start-period as geoserver needs quite some time to start up.

Docker-compose allows for healthchecks, and displays Up (healthy) in the “state” column when you call docker-compose ps. But docker-compose doesn’t actually restart failed services. For that, you need docker-autoheal as an extra service. At the core, it consists of a single shell script that asks docker if there are containers matching the filter health=unhealthy and optionally autoheal=true. If found, they get restarted.

I have a mix of services (geoserver, pgbouncer, nginx) with only the geoservers having a health check. So I configured autoheal like this in my docker-compose.yml:

autoheal:
  image: willfarrell/autoheal:1.1.0
  tty: true
  restart: unless-stopped
  environment:
    - AUTOHEAL_CONTAINER_LABEL=autoheal
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock

And the services with healthcheck got the autoheal label:

geoserver:
  image: ...
  labels:
    autoheal: true  # <= there's an error here

Autoheal didn’t seem to be working for me. No logs. Well, the geoservers that could need to be autohealed rarely failed, which is good news, but made it harder to see if autoheal was working.

Last week I made some changes that improved the speed for several geoserver maps. But it also made geoserver as a whole unstable. So I had an Up (unhealthy) container. But autoheal didn’t restart it. And there was nothing in autoheal’s log output.

It turned out that autoheal: true was the problem. true needs to be quoted: autoheal: "true", as autoheal searches for the lowercase value. Just true gets translated to a capitalized True (probably a representation of the boolean value) by docker compose, which autoheal doesn’t search for.

After quoting the value, autoheal properly restarted misbehaving geoservers when they went belly-up:

geoserver:
  image: ...
  labels:
    autoheal: "true"  # <= quoted value works

That took some time to figure out… Especially as there was totally no output from the autoheal docker. A short message upon startup (echo "autoheal is running") would personally have helped me to be sure the logging was actually working. I spend quite some time googling and figuring out whether there was actually something wrong with my logging. That’s why the tty: true is in there, for instance.

I hope this blog entry has the right words to help someone else plagued with the same problem :-) A quick note in the README, warning about the quotes, is probably a better solution. I’ve submitted an issue for it.

A win for open source, btw: I could read the source code for the autoheal shell script. That helped me figure out what was going wrong.

Reinout van Rees

My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):