At my work we have quite a number of different sites/apps. Sometimes it is just a regular django website. Sometimes django + celery. Sometimes it also has extra django management commands, running from cronjobs. Sometimes Redis is used. Sometimes there are a couple of servers working together….
Anyway, life is interesting if you’re the one that people go to when
something is (inexplicably) broken :-) What are the moving parts? What do you
need to check? Running top
to see if there’s a stuck process running at
100% CPU. Or if something eats up all the memory. df -h
to check for a
disk that’s full. Or looking at performance graphs in Zabbix. Checking our
“sentry” instance for error messages. And so on.
You can solve the common problems that way. Restart a stuck server, clean up some files. But what about a website that depends on background jobs, run periodically from celery? If there are 10 similar processes stuck? Can you kill them all? Will they restart?
I had just such a problem a while ago. So I sat down with the developer. Three things came out of it.
I was told I could just kill the smaller processes. They can be re-run later. This means it is a good, loosely-coupled design: fine :-)
The README now has a section called “troubleshooting” with a couple of command line examples. For instance the specific celery command to purge a specific queue that’s often troublesome.
This is essential! I’m not going to remember that. There are too many different sites/apps to keep all those troubleshooting commands in my head.
A handy script (bin/repair
) that prints out the commands that need
to be executed to get everything right again. Re-running previously-killed
jobs, for instance.
The script grew out of the joint debugging session. My colleague was telling me about the various types of jobs and celery/redis queues. And showing me redis commands that told me which jobs still needed executing. “Ok, so how do I then run those jobs? What should I type in?”
And I could check serveral directories to see which files were missing. Plus commands to re-create them. “So how am I going to remember this?”
In the end, I asked him if he could write a small program that did all the work we just did manually. Looking at the directories, looking at the redis queue, printing out the relevant commands?
Yes, that was possible. So a week ago, when the site broke down and the
colleague was away on holiday, I could kill a few stuck processes, restart
celery and run bin/repair
. And copy/paste the suggested commands and
execute them. Hurray!
So… make your sysadmin/devops/whatever happy and…
Provide a good README with troubleshooting info. Stuff like “you can always
run bin/supervisorctl restart all
without everything breaking. Or
warnings not to do that but to instead do xyz.
Provide a script that prints out what needs doing to get everything OK again.
My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):