Service monitoring victory

Tags: python, django, nelenschuurmans

As a company, we’re handling quite a lot of customer data. This involves a lot of different software components: databases (postgres, oracle), jdbc, xml-rpc, django website, a windows server, etcetera. And sometimes one of those components falls over, bringing the data import/whatever for one or more customers to a halt.

Waiting for the phone to ring (“Hello, this is customer xyz….”) isn’t the best way of monitoring it. Doing a manual click-through every morning to check it by hand isn’t fun (and so it isn’t guaranteed to happen).

Solution: automatic monitoring. The things we want to monitor are apparently not covered by the standard munin/nagios types of checks (I don’t have enough knowledge about that part of our software to know for sure). So a colleague is working on a dashboard in Django. External checks write their data to the django database and django shows it, basically. It keeps historical records (much like munin/nagios does).

The dashboard isn’t finished yet, however, and Mandatory Menial Manual Morning Checking was about to be scheduled for the IT department. Manual menial tasks is something I abhor, so I asked around a bit and found an existing snippet of code that checks the component that is the cause of 99% of the downtime. Three hours of coding later, we’ve now got a temporary web page that lists whether that component is up for our various customers. Easily integratable into the real dashboard later on.

Note that getting someone to code such an automatic check was exaclty the purpose of asking the IT department to do the manual monitoring. The trick worked :-)

This morning we had our first small victory: a colleague looked at the page and noticed three JDBC couplings were down. We restarted them and got them back on-line.

Like I said in my Hudson continuous integration article, we’ve got a laptop + big monitor in our IT room with our hudson on continuous display. I’ve now opened a second window with the temporary JDBC tester page so that we can’t possibly overlook another downtime :-)

simple monitoring page
 
vanrees.org logo

Reinout van Rees

My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):