We (Nelen & Schuurmans) have quite some servers. Most of them are vmware virtual machines in a vshpere cluster.
Once in a while, one or more of the machines got reported by our monitoring tool (zabbix) as having a time drift problem. Weird, as we have NTP running everywhere. And weird if you look at django logfiles and see a negative jump in time all of a sudden.
We run ntpd everywhere to keep the time in sync with two windows domain servers. Every time a server drifted, the ntpd daemon turned out to have died. Without leaving any trace in any logfile.
ntpd kills itself when the time drift is more than 20 minutes or so, assuming that it hurts more than it helps. There’s a switch to prevent this self-killing behaviour, but ntpd killed itself anyway.
In the end, an external sysadmin found the problem:
So it was a combination of host machines with a drifted time and the fact that vmware adjusts the VM’s time after certain actions.
Writing it down as it might help someone googling for this problem :-)
My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):