We (Nelen & Schuurmans) have quite some servers. Most of them are vmware virtual machines in a vshpere cluster.
Once in a while, one or more of the machines got reported by our monitoring tool (zabbix) as having a time drift problem. Weird, as we have NTP running everywhere. And weird if you look at django logfiles and see a negative jump in time all of a sudden.
We run ntpd everywhere to keep the time in sync with two windows domain servers. Every time a server drifted, the ntpd daemon turned out to have died. Without leaving any trace in any logfile.
ntpd kills itself when the time drift is more than 20 minutes or so, assuming that it hurts more than it helps. There’s a switch to prevent this self-killing behaviour, but ntpd killed itself anyway.
In the end, an external sysadmin found the problem:
One of the physical vsphere host machines (big server, lots of blades) was mis-configured: the ntp daemon on the host machine itself was configured, but it was not configured to automatically start when you start up the server…
This host machine started to drift its time, naturally.
Several actions vsphere does on a VM result in a very very short period where the VM is frozen. Actions like “full backup”, “snapshot” and “automatically moving from one host machine to another for performance reasons”. Very short, but vmware does adjust the time inside the VM. It keeps track of how long the quick action took and adjusts the VM’s time accordingly.
It adjusts the time relative to the host machine’s time. So if an action took 1 second, the second is added to the host machine’s time and the result is set as the VM time. All is still well if the VM stays on the same host.
If the action includes moving the VM to a different host… And that host is the one with the drifted time…. If the host machine’s time has drifted by an hour, the VM that gets moved to that host suddenly gets its internal time moved by an hour…
So it was a combination of host machines with a drifted time and the fact that vmware adjusts the VM’s time after certain actions.
Writing it down as it might help someone googling for this problem :-)
My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):