Easy maintainance: script that prints out repair steps

Tags: python, django, nelenschuurmans

At my work we have quite a number of different sites/apps. Sometimes it is just a regular django website. Sometimes django + celery. Sometimes it also has extra django management commands, running from cronjobs. Sometimes Redis is used. Sometimes there are a couple of servers working together….

Anyway, life is interesting if you’re the one that people go to when something is (inexplicably) broken :-) What are the moving parts? What do you need to check? Running top to see if there’s a stuck process running at 100% CPU. Or if something eats up all the memory. df -h to check for a disk that’s full. Or looking at performance graphs in Zabbix. Checking our “sentry” instance for error messages. And so on.

You can solve the common problems that way. Restart a stuck server, clean up some files. But what about a website that depends on background jobs, run periodically from celery? If there are 10 similar processes stuck? Can you kill them all? Will they restart?

I had just such a problem a while ago. So I sat down with the developer. Three things came out of it.

  • I was told I could just kill the smaller processes. They can be re-run later. This means it is a good, loosely-coupled design: fine :-)

  • The README now has a section called “troubleshooting” with a couple of command line examples. For instance the specific celery command to purge a specific queue that’s often troublesome.

    This is essential! I’m not going to remember that. There are too many different sites/apps to keep all those troubleshooting commands in my head.

  • A handy script (bin/repair) that prints out the commands that need to be executed to get everything right again. Re-running previously-killed jobs, for instance.

The script grew out of the joint debugging session. My colleague was telling me about the various types of jobs and celery/redis queues. And showing me redis commands that told me which jobs still needed executing. “Ok, so how do I then run those jobs? What should I type in?”

And I could check serveral directories to see which files were missing. Plus commands to re-create them. “So how am I going to remember this?”

In the end, I asked him if he could write a small program that did all the work we just did manually. Looking at the directories, looking at the redis queue, printing out the relevant commands?

Yes, that was possible. So a week ago, when the site broke down and the colleague was away on holiday, I could kill a few stuck processes, restart celery and run bin/repair. And copy/paste the suggested commands and execute them. Hurray!

So… make your sysadmin/devops/whatever happy and…

  • Provide a good README with troubleshooting info. Stuff like “you can always run bin/supervisorctl restart all without everything breaking. Or warnings not to do that but to instead do xyz.

  • Provide a script that prints out what needs doing to get everything OK again.

 
vanrees.org logo

Reinout van Rees

My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):