I use docker-compose quite a lot. Most of the python/django stuff we deploy is done with docker-compose (one of the two big ones is in kubernetes already). A while back I moved several “geoservers” to docker-compose. Geoserver is a web mapping server written with java/tomcat. Normally pretty stable, but you can get it to crash or to become unresponsive.
So that’s something for which docker’s health check comes in handy. You
can configure it in docker-compose itself, but I put it in our geoserver’s
custom Dockerfile
as I was making some other modifications anyway:
FROM docker.osgeo.org/geoserver:2.23.1
... some unrelated customizations ...
HEALTHCHECK --interval=20s --timeout=10s --retries=3 --start-period=150s \
CMD curl --fail --max-time 3 http://localhost:8080/geoserver/web/ || exit 1
A simple “curl” command to see if the geoserver still displays its start
page. With a generous --start-period
as geoserver needs quite some time to
start up.
Docker-compose allows for healthchecks, and displays Up (healthy)
in the
“state” column when you call docker-compose ps
. But docker-compose doesn’t actually
restart failed services. For that, you need docker-autoheal as an extra service. At the
core, it consists of a single shell script that asks docker if there are
containers matching the filter health=unhealthy
and optionally
autoheal=true
. If found, they get restarted.
I have a mix of services (geoserver, pgbouncer, nginx) with only the
geoservers having a health check. So I configured autoheal like this in my
docker-compose.yml
:
autoheal:
image: willfarrell/autoheal:1.1.0
tty: true
restart: unless-stopped
environment:
- AUTOHEAL_CONTAINER_LABEL=autoheal
volumes:
- /var/run/docker.sock:/var/run/docker.sock
And the services with healthcheck got the autoheal label:
geoserver:
image: ...
labels:
autoheal: true # <= there's an error here
Autoheal didn’t seem to be working for me. No logs. Well, the geoservers that could need to be autohealed rarely failed, which is good news, but made it harder to see if autoheal was working.
Last week I made some changes that improved the speed for several geoserver
maps. But it also made geoserver as a whole unstable. So I had an Up
(unhealthy)
container. But autoheal didn’t restart it. And there was nothing
in autoheal’s log output.
It turned out that autoheal: true
was the problem. true
needs to be
quoted: autoheal: "true"
, as autoheal searches for the lowercase
value. Just true
gets translated to a capitalized True
(probably a
representation of the boolean value) by docker compose, which autoheal doesn’t
search for.
After quoting the value, autoheal properly restarted misbehaving geoservers when they went belly-up:
geoserver:
image: ...
labels:
autoheal: "true" # <= quoted value works
That took some time to figure out… Especially as there was totally no
output from the autoheal docker. A short message upon startup (echo
"autoheal is running"
) would personally have helped me to be sure the
logging was actually working. I spend quite some time googling and figuring
out whether there was actually something wrong with my logging. That’s why the
tty: true
is in there, for instance.
I hope this blog entry has the right words to help someone else plagued with the same problem :-) A quick note in the README, warning about the quotes, is probably a better solution. I’ve submitted an issue for it.
A win for open source, btw: I could read the source code for the autoheal shell script. That helped me figure out what was going wrong.
My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):