Django under the hood: django at instagram - Carl Meyer

Tags: django, djangocon

(One of my summaries of a talk at the 2016 django under the hood conference).

Instagram is huge. He mentioned a number of fun facts with lots of zeros in them. Oh, and cat photo’s. They have tens of thousands django instances.

Instagram started in 2010. According to one of the founders, django was super easy set-up. There is one obvious way of doing things. There’s a test framework build-in. A few months later, october 2010, they had 1 million users.

In June 2011, instagram had 5 million users. All in the database. With database routers they could partition it over multiple servers. The django ORM was managing those 5 million users in the database just fine. But slowly the number of likes was getting too much. It needed custom sharding. They used postgres schemas, which is more like a “logical shard”. They could then map those logical shards at will unto actual physical servers.

In the end, they started a custom ORM to better handle the huge amount of sharding that they needed. The likes were moved over first, two years later the user data moved. The Django ORM is still used in places, but the huge data is in a custom ORM.

There was one problem: Justin Bieber. The sharding was done based on user. So one user’s data was stored on just one shard. So when Justin Bieber posted something and the entire world liked it…. that shard got very very busy :-)

April 2012 they had 40M instagrammers. They were acquired by facebook and had to move from amazon to facebook data centers. A disaster recovery exercise meant they had to move everything a second time. They didn’t want to go through it a third time, so they switched to a multi-datacenter architecture. That had its own problems, especially regarding caching and memcache. They later moved to facebook’s TAO system.

They now use TAO for everything. The last postgres cluster has just been swiched off.

June 2016 they hit 500 million users. So it was finally time to try and upgrade their heavily patch django 1.3. (Laughter in the room).

They did it the brute force way. Just install 1.8 and keep on fixing bugs until all the tests pass. It took a couple of months. Oh, and it had to keep on being compatible with 1.3 :-) Slowly all servers were moved over. And a short time ago they finally removed the 1.3 code. So instagram now runs fully on django 1.8.

What happens when a request hits instagram:

  • Proxygen. A facebook open source http load balancer.

  • Django + uWSGI

  • TAO, cassandra, everstore, celery/rabbitmq.

Instagram does continuous deployments. They have 30-50 deploys per day! Within 10 minutes the changes are on every django server. Every deploy has about 3 commits in them.

To support this, they need good visible statistics to spot performance regressions. If one happens, you can look at the commits that went live around that time. It can also be caused by a new feature that’s slowly being enabled (“feature gate”).

If they really need in-depth performance metrics, they temporarily enable python’s Cprofiler on a very limited amount of requests. But… they’ve made a change. Cprofiler normally measures CPU time. But you can pass it any function that returns a number. So they tell it the number of CPU instructions instead.

Fixing efficiency regressions:

  • Some regressions are obvious when you see them.

  • Don’t do useless work. Don’t grab data if you don’t need it.

  • Cache things that don’t change.

  • Sometimes you need to go deeper: with cython you can change a .py to a .pyx and compile it to C code.

  • Extreme hot spots can be rewritten as C code.

Django is tightly integrated, allowing instagram to get started quickly. Django is also loosely coupled enough so that they could replace, for instance, the ORM.

Django is at the core of instagram. They’ve just become a gold level sponsor of the django software foundation (=> applause!). And all facebook and instagram employees are allowed to contribute code to django.

Things they want to start doing:

  • Python 3. They’ll probably also do it in the same brute force way they did the 1.3 to 1.8 django migration :-)

  • Async(io) could help performance-wise.

  • Traffic replay.

  • Perhaps they’ll start using pypy? That’ll be a hard one.

  • A long way out…. they’re experimenting (or someone else is) on trying to integrate the JIT compiler into Cpython.

beautiful vulcanic lake

Photo explanation: Instagram? Good excuse just to show a beautiful picture of a German vulcanic lake (“Booser Maar”). This is the exploding lake that dumped ground layers on a nearby vulcano, see the photo I’ve shown in my validation talk summary.

water-gerelateerd Python en Django in het hartje van Utrecht!
 
vanrees.org logo

About me

My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):