Healthy webapps through continuous introspection - Erik van Zijst¶

Tags: django, djangocon

There are lots of small things that you won’t really notice in development, but that’ll bite your visitors in production.

Sometimes there’s code that does a lot when it should not need to do that. They (=bitbucket) had a script that only had to look up a user’s repository location. Bitbucket is django, so they used the django ORM. But for a supposed-to-be-quickly-starting script to load the entire django ORM… It took 1.4 seconds to start up. The fix was simple: just fire off a small piece of SQL. They deployed the fix a few weeks ago and the load on their servers decreased 60%. Wow. And they’ve been buying lots and lots of servers… The script now starts in 0.09 seconds insead of 1.4.

Some common reasons of slowness:

Slow SQL queries. Or too many.

One time they had a specific API endpoint that was slow for the end users. Locally it worked just fine. It took lots of work to replicate it locally. In certain corner cases it would fire off 100k database queries!
Lock contention between threads (python global interpreter lock), file locks by git/hg, database table/row locks.
Excessive IO to disk or network.

Memcached is great to speed up your application. Unless you use it in the wrong way. One of their python loops executed the exact same query lots of times instead of holding on to the result. Memcached is quick, but requesting the same thing 500 times does use up IO.
Evil regular expression. Like r'^(a+)+$'. Easy to get them to run a long time.

Possible consequences: error 503 - worker pools are full. Or error 500 if requests time out (gunicorn SIGKILL, its build-in protection against denial of service). The problem with that SIGKILL is that that’s the most drastic kill. kill -9. Which means the process is really dead: it also cannot release locks or clean up stuff or release database connections. A SIGKILL also destroys any evidence: you won’t get a traceback out of a process killed in that way!

A handy tool: dogslow. Django middleware that emails tracebacks of slow requests. There’s no performance penalty and it is safe on production. You must configure it so that it fires before gunicorn SIGKILLs :-)

With dogslow, they got an email about a request that took 25 seconds, but which normally is just a few 100ms. The cause turned out to be a heavily-fragmented git repository. Calling git-gc (garbage collection) reduced the time needed from 35 seconds to 0.1. With dogslow they noticed it immediately and they could fix it.

Geordi is also middleware that is safe to use in production. You can selectively profile individual requests by adding ?__geordi__ to any URL. (Only admin users can do that, btw). It produces a PDF call graph. Behind the scenes, it runs the request outside the regular worker pool via celery. You won’t have any timeouts there and the request can take as much time as it needs.

Both dogslow and geordi are reactive. They tell you something when it already went wrong. Or you have to manually run it. So they made something that’s more proactive: interruptingcow.

You can use interruptingcow to timebox chunks of code. It raises a RuntimeError in the worker, which means it fails in a controlled manner: you can ensure proper cleanup (locks, db connections). Very simple and very useful utility:

from interruptingcow import timeout

with timeout(20.0, RuntimeError):
    run_something_that_might_take_long()

These three utilities (dogslow, geordi and interruptingcow) can help you a lot!

Dogslow for tracebacks of slow requests.
Geordi to profile production environments.
Interruptingcow to prevent slowness and to fail gracefully.

Otherwise you’re stuck with customer emails like “it is slow”….