Reinout van Rees’ weblog

Django under the hood: django at instagram - Carl Meyer

2016-11-04

Tags: django, djangocon

(One of my summaries of a talk at the 2016 django under the hood conference).

Instagram is huge. He mentioned a number of fun facts with lots of zeros in them. Oh, and cat photo’s. They have tens of thousands django instances.

Instagram started in 2010. According to one of the founders, django was super easy set-up. There is one obvious way of doing things. There’s a test framework build-in. A few months later, october 2010, they had 1 million users.

In June 2011, instagram had 5 million users. All in the database. With database routers they could partition it over multiple servers. The django ORM was managing those 5 million users in the database just fine. But slowly the number of likes was getting too much. It needed custom sharding. They used postgres schemas, which is more like a “logical shard”. They could then map those logical shards at will unto actual physical servers.

In the end, they started a custom ORM to better handle the huge amount of sharding that they needed. The likes were moved over first, two years later the user data moved. The Django ORM is still used in places, but the huge data is in a custom ORM.

There was one problem: Justin Bieber. The sharding was done based on user. So one user’s data was stored on just one shard. So when Justin Bieber posted something and the entire world liked it.... that shard got very very busy :-)

April 2012 they had 40M instagrammers. They were acquired by facebook and had to move from amazon to facebook data centers. A disaster recovery exercise meant they had to move everything a second time. They didn’t want to go through it a third time, so they switched to a multi-datacenter architecture. That had its own problems, especially regarding caching and memcache. They later moved to facebook’s TAO system.

They now use TAO for everything. The last postgres cluster has just been swiched off.

June 2016 they hit 500 million users. So it was finally time to try and upgrade their heavily patch django 1.3. (Laughter in the room).

They did it the brute force way. Just install 1.8 and keep on fixing bugs until all the tests pass. It took a couple of months. Oh, and it had to keep on being compatible with 1.3 :-) Slowly all servers were moved over. And a short time ago they finally removed the 1.3 code. So instagram now runs fully on django 1.8.

What happens when a request hits instagram:

  • Proxygen. A facebook open source http load balancer.
  • Django + uWSGI
  • TAO, cassandra, everstore, celery/rabbitmq.

Instagram does continuous deployments. They have 30-50 deploys per day! Within 10 minutes the changes are on every django server. Every deploy has about 3 commits in them.

To support this, they need good visible statistics to spot performance regressions. If one happens, you can look at the commits that went live around that time. It can also be caused by a new feature that’s slowly being enabled (“feature gate”).

If they really need in-depth performance metrics, they temporarily enable python’s Cprofiler on a very limited amount of requests. But... they’ve made a change. Cprofiler normally measures CPU time. But you can pass it any function that returns a number. So they tell it the number of CPU instructions instead.

Fixing efficiency regressions:

  • Some regressions are obvious when you see them.
  • Don’t do useless work. Don’t grab data if you don’t need it.
  • Cache things that don’t change.
  • Sometimes you need to go deeper: with cython you can change a .py to a .pyx and compile it to C code.
  • Extreme hot spots can be rewritten as C code.

Django is tightly integrated, allowing instagram to get started quickly. Django is also loosely coupled enough so that they could replace, for instance, the ORM.

Django is at the core of instagram. They’ve just become a gold level sponsor of the django software foundation (=> applause!). And all facebook and instagram employees are allowed to contribute code to django.

Things they want to start doing:

  • Python 3. They’ll probably also do it in the same brute force way they did the 1.3 to 1.8 django migration :-)
  • Async(io) could help performance-wise.
  • Traffic replay.
  • Perhaps they’ll start using pypy? That’ll be a hard one.
  • A long way out.... they’re experimenting (or someone else is) on trying to integrate the JIT compiler into Cpython.
beautiful vulcanic lake

Photo explanation: Instagram? Good excuse just to show a beautiful picture of a German vulcanic lake (“Booser Maar”). This is the exploding lake that dumped ground layers on a nearby vulcano, see the photo I’ve shown in my validation talk summary.

water-gerelateerd Python en Django in het hartje van Utrecht!

Django under the hood: funding open source, the hard way - Nadia Eghbal

2016-11-04

Tags: django, djangocon

(One of my summaries of a talk at the 2016 django under the hood conference).

You don’t do open source for the money. Django’s projected revenue (the DSF) is 200.000. Instagram’s is 3.200.000.000....

If you want to have funding for open source, money itself is not the problem. There is enough money. The real problem is access to money. As a home-owner, you can get a loan from the bank. As a start-up you can try and find investors. But as open source, there’s no clear way.

You could do things with tipping, bug bounties and so, but that just tries to grab a bit of ready cash, it doesn’t provide regular funding.

Many major open source projects were started by employees. Including Django! It is an environment where you can do some experiments and if they go wrong, you still have your regular job.

A further problem: maintenance of existing projects instead of starting something new. Maintenance is hard. Very good that the django software foundation managed to hire someone (Tim) to basically work on django maintenance. The last releases have been the first ones that happened on time :-)

We need to figure out four things:

  • Who needs money. Are there metrics? Can you rank projects when someone comes up to you and wants to give money to “some project”? No.

    Numbers are important, for instance if you want to register as a non-profit institution. A good start is https://libraries.io/, which looks at library usage. “openssl is used by 66% of the internet, so it is important”.

    Projects internally can be measured by project stage. Are they just starting? How many people are looking at it. If it is mature, how many maintainers are actively working on it? And so on.

  • What do they need money for? Here also, you need money for different things per stage. If a project is mature, you might need funding for maintenance or major refactorings. When growing, sprint funding might be more important.

  • How to fund them. How to pay, how to pay for what. Do you give directly to people? Or do you fund a project? The last might sound better, but it does come with lots of extra paperwork.

    There’s a lot of legal standard process. You’ll need that. You should be registered as a non-profit, for instance, for donations to be tax-deductible.

    If you fund projects: centralize efforts. Don’t have 20 different projects in the same open source area, combine it.

    If you want to work on funding individual people: build opportunities for individual grands. There’s a lot of movement in this area right now.

    We still need to figure out how projects can become financially sustainable. There’s no ready conclusion right now.

  • Who should fund them. So: who cares the most about protecting the open source commons?

    Companies are logical. They need the open source commons for cheap resources and protection against competition. But they’re also beholden to commercial company goals.

    Goverment is the the national steward of public goods. But they are risk-averse. And... many projects are translational. Does it make sense that the US government funds a Australian programmer?

    Academia: they have a sustainable model for R&D, but the pace is slow.

Finding the right funder will probably be the last piece of the puzzle, after we’ve figured out the first three points.

So... we’re at the start of exciting times!

  • We got software producers to care about open source (1970s/1980s).
  • We got open source into the hands of interested users (1990s/2000s).
  • We got companies (like instagram) to earn tons of money (“outsized returns”) because of open source (2000s/2010s).
  • The last stage is to get that ton of company money into the hands of the software producers.
'dead' knights lying around

Photo explanation: fighting for funding? While visiting last weekend, we saw some re-enacters recording a movie at a castle in Monreal, Germany. Fun to watch.

water-gerelateerd Python en Django in het hartje van Utrecht!

Django under the hood: modern javascript - Idan Gazit

2016-11-04

Tags: django, djangocon

(One of my summaries of a talk at the 2016 django under the hood conference).

Note by Reinout beforehand: the company I work for (DutH sponsor Nelen&Schuurmans) has a job opening for a front-end engineer. If you speak Dutch and are interested, give us a ping.

There’s some negative sentiment around Javascript. It might be hard. You might be scared of it. It is much less elegant than python. Etc.

With “modern javascript” he means “civilized javascript”. So how to work with javascript without missing python terribly. There are a lot of reasons why javascript might feel scary. Callback hell. Weird prototyping instead of regular classes. Less syntactic sugar.

But.... javascript is the only runtime that is shipped with every browser!

How did we get here?

Originally, javascript saw limited use to put snow on your screen during christmas time and some form validation. Then came google with a super-fast javascript engine, V8. Node (=server side javascript) is basically the V8 engine with some libraries.

So you have the browser world and the node world. Packaging for the browser is by hand or with bower and so. Packaging for node is done with “npm”.

For a long time, “ecmascript 5” was the main javascript. A bit like python 2. Everybody supported it. in 2015 there finally came a new, improved version: “ES6”, “ecmascript 6”. They’ve now decided to bring out a new version every year! The process is transparent, a bit like pep8, and all on github.

ES6 can be used safely, all modern browser support almost everything. It is not bleeding edge.

Fortunately, there’s Babel is a transformer that converts any modern javascript version to ES5. Most of the new features are syntactic sugar anyway.

Javascript can be pretty, too

What is it that makes modern javascript friendlier and easier and cleaner?

  • Modules. No more var math = require('lib/math'). Now you can use import math from 'lib/math'.
  • You now have default arguments to functions.
  • Template strings! Multiline strings!
  • Prototype based inheritance is still weird, but the way you write it in ES6 makes much more sense and is actually readable.
  • Actually, there is a real class keyword that works like you expect it to.
  • A huge problem is javascript’s this. Where python’s self is clear, this can mean various things in various places based on who asks for it. The new ES6 => syntax helps here.
  • Promises. A solution for the callback hell. “then/catch”. If a promise, which would normally be a callback, is resolved, the “then” parts are handled. If the promise fails, the “catch” will be invoked. It makes it much easier to reason about then regular callbacks.

Tooling and culture

So... the core language is now nicer. What also improved: the tooling. Every week the tools you should use change. Best practice from 4 months ago is ridiculed as out-of-date today. It is a cambrian explosion.

The upside: lots of nice experiments are done. Lots of nice stuff comes out of those experiments.

And... you don’t need to use the latest library just because it is mentioned on hackernews!

And you don’t need to update your npm packages every day. He updates it once per month on a specific “npm update day”. It does also take a day, btw.

A problem: hypermodularity. On the other hand: it takes unix’ “a tool should do only one thing” to its logical extreme.

Everybody applauded when Idan named and thanked Donald Stufft for his work on python’s packaging infrastructure that works very nice.

Look at yarn. If you see NPM as “easy_install”, yarn is “pip”.

Also use ESlint to check your code and warn about formatting problems. You can also use “StandardJS”. So set it up in your editor and on your testing machine.

On a scale form “strongly typed” to “loosely typed”, javascript is even further off from the spectrum... That breeds a lots of bugs. You could look at static typing: “flow” and “typescript” are two tools you can use.

If you want something like ipython/jupyter: look at https://runkit.com/home

Frameworks

React tastes like Django. And it has the clearest mental model of all the front end frameworks.

Ember tastes like Rails. He likes it less. There is a lot of convention instead of configuration.

Angular: he hasn’t done any work with it. There are two versions, angular and angular2. They are completely different animals. Completely incompatible.

There are lots and lots more.

Orchestration and deployment

Modularity invites a wealth of tooling, but you don’t want to run separate tools all the time to package it all up. So tools like “grunt” and “gulp” came along.

Webpack is rapidly becoming the dominant tool in this area. It not only knows how to deal with javascript, but also css and images. So code and static assets. Asset hashing. Dependency resolution. Loaders and plugins. Bundle splitting. “Tree shaking”: automatically excluding everything from your codebase that you don’t actually use.

The nicest thing about modern javascript (and webpack) is hot module replacement. You can now reload entire parts of your running application without losing state. He loves it. Setting up webpack can be a bit of a pain, but it is well worth the effort.

Closing comments

With modern javascript, you might not feel completely at home yet, but there’s no need to fear it or to feel completely out of place. There’s lots of of fun to be had.

railway crossing at currently disused railway

Photo explanation: this is a railway crossing in a currently unused railway (“Eifelquerbahn” near Uersfeld in Germany). But the old tracks are still there and the automatic railway crossing mechanism will undoubtedly still work. Very robust stuff and still in daily use in many places. You can do wonders with the modern internet and with javascript, but what is the lifespan compared to this 50 year old perfectly functioning rail safety equipment?

water-gerelateerd Python en Django in het hartje van Utrecht!

Django under the hood: custom database backends - Michael Manfre

2016-11-04

Tags: django, djangocon

(One of my summaries of a talk at the 2016 django under the hood conference).

Tip: watch django in depth by James Bennett. The database backend is right there at the bottom of the lowest level.

What does the database backend do? It sits between the Django ORM and the actual database driver. There’s a PEP249, the DB-API 2.0 specification for python code to talk to the actual database driver.

Django abstracts away many of the differences between databases. But not all databases are created equal, so sometimes supporting what django expects is hard. Michael maintains the microsoft sql backend and showed some of the differences.

If you need a custom database backend, you could subclass an existing django database backend. There’s a read-only postgres db backend that has only a few lines of code. But if you create one from scratch, you need to implement about 8 classes.

  • The DatabaseWrapper talks to the PEP249 python database library. Important: the “vendor” string to help django do specific things when it uses your database.

    There are other attributes that tell django how to map simple queries to actual SQL. iexact, less than, stuff like that.

  • CursorWrapper. This one wraps the database cursor. So it tranlates execute, executemany, fetchone, fetchmany, fetchall, etc., to how the database talks.

  • CursorDebugWrapper: the same as above, only it adds timing information and logging everywhere. Django uses it in DEBUG mode.

  • DatabaseFeatures: a list of features that the database supports. It is mainly used to automatically exclude/include tests from django’s testcase.

  • DatabaseSchemaEditor: used by the migration mechanism to change your database schema. Altering a field is complex.

  • DatabaseCreation. It creates and destroys test databases.

  • DatabaseIntrospection. Used by the inspectdb management command. For his mssql database backend, it is important functionality. It is used relatively often.

  • DatabaseValidation: this hooks the backend into django’s upon-startup validation mechanism.

  • DatabaseOperations is where various bits and pieces that didn’t fit elsewhere went. A big part: date and time helpers.

There’s more than these classes, though.

If you make a query, in the end the .as_sql() method is called on an “sql compiler”. For a custom database backend, you might need to do customization here. Internally, django seems to prefer Postgresql’s sql style.

Database expressions are nice. But they did mean a substantial amount of work in his mssql database backend.

A good video: Josh Smeaton’s “customize your sql”.

You need to look at database-specific ways in which you could do database injection. And catch it.

You need custom tests. And you’ll sometimes need to monkeypatch existing tests with @expectedFailure. But the good thing is that there’s a huge amount of existing tests that will be run on your database.

railway tunnel entrance

Photo explanation: what is in the back of that tunnel? (Railway tunnel in Monreal, Germany).

water-gerelateerd Python en Django in het hartje van Utrecht!

Django under the hood: validation - Loïc Bistuer

2016-11-04

Tags: django, djangocon

(One of my summaries of a talk at the 2016 django under the hood conference).

Loïc has mostly worked on forms and the ORM.

The main concerns with validation are:

  • Enforcement.
  • User experience
  • Performance
  • Convenience

Some items go well together. Enforcement and user experience like each other. You don’t want wrong data. And you want good feedback. Validation helps with that.

But “user experience” and “performance” are harder to combine. Checks do cost time.

Similarly “user experience” and “developer convenience”. Why do you have to check anything on the backend when you already checked it on the front end? Extra work.

Where to validate data?

You can do it in the front end: javascript, html5/browser or in native code like phone apps. The nice thing is that it is fast and provides direct feedback. The drawback is that you have to do the same thing on the backend again, as you cannot trust anything coming in from the front end.

You can also use forms and the django rest framework serializer. Designed for the task, but it is easy to circumvent. Similarly django views.

You could do validation directly on the model. Only problem is that it isn’t run by default. But.... you could call .full_clean() in the model’s .save() method. That makes it harder to circumvent. Though.... bulk creating objects bypasses the save method...

Some validation can be done in the database. It is designed for the task and impossible to circumvent. And fast. But it will be backend-specific, harder to write, harder to audit and harder to maintain.

Field validation

Field validation is the bread-and-butter of django’s validation. Presence validation, for instance. required=True on a field. Similarly choice validation and range validation (max_length=100, choices=...). Uniqueness validators. Those are all set on the fields.

You can also write specific ones:

def validate_even(value):
    ....
    if ...
        raise ValidationError(...)

myfield = Field(... validators=[validate_even])

A validator is a simple function. You can also write them as a class.

You can customize the error messages raised by validations with a dict. If a ValidationError is raised, the validation mechanism looks for a code attribute on the error message. This is taken as a key that’s looked up in the dict of error messages.

ValidationErrors accept several things when instantiated. Just a string, a list (of multiple errors) or a dict, mapping field names to errors.

Example:

raise ValidationError("Invalid value %s" % 42)

Better: add translation:

raise ValidationError(_("Invalid value %s") % 42,

Even better: add a code:

raise ValidationError(_("Invalid value %s") % 42,
                      code='invalid')

Tip: a code attribute also makes it easier to test!

And to help translators, use a dict (“params”) which will be fed to the error string:

raise ValidationError(_("Invalid value %(value)s"),
                      code='invalid',
                      params={'value': 42})

But... If you don’t write a reusable app and just want to display a error string in your own views, just use the simple version.

There are some handy utilities. Form.add_error(), for instance. you pass it a field and an error. If the field is None, it will be a whole-form-level error that’s typically displayed at the top.

.add_error() was added in django 1.7 and it replaced 400 lines of documentation on how to do it yourself :-)

The errors end up in an ErrorDict. Historically you have the .as_ul() and .as_text() methods to return the errors for html or plain text. In 1.7, .as_data() and .as_json() were added to make it easier to work with. You can add a .get_json_data() method if you want to provide extra data to .as_json().

Like forms, models have a .full_clean().

If you use ModelForms, the ._post_clean() method is the glue that ties the form to the model. It creaes a model instance, but does some special handling around uniqueness.

Closing words

Where to validate? His tips:

  • Validate on your front end when practical.
  • Add mission-critical checks in the database itself.
  • And all the rest of the validation in django itself? Pick the spot where it is handiest for you. Forms, model forms, field validation, etc.

Look at django rest framework, also when thinking about validation. Django rest framework did many things right. For instance by making database models an implementation details: the externally exposed objects don’t need to correspond to a single specific data model.

geographical layers exposed

Photo explanation: geographical layers exposed in a small excavation in the German vulcanic “Eifel” region. By looking at the alternating red/black ground layers they were able to validate that the vulcano and a nearby exploding lake (“Booser maar”) were active at the same time.

water-gerelateerd Python en Django in het hartje van Utrecht!

Django under the hood: keynote about mental health in tech - Jennifer Akullian

2016-11-04

Tags: django, djangocon

(One of my summaries of a talk at the 2016 django under the hood conference).

The trust equation: Trust = (credibility + reliability + intimacy) / self-orientation. If you want to build trust with someone, keep this equation in mind.

She asked for a show of hands: “how many people have glasses or contact lenses?” Afterwards she asked “how many people have been told to just try harder instead using glasses?” Laughter, no hands went up. “Well, I have a mental illness and have been told to try harder. Just focus on happy things, just try to be more happy, etc...”

She has bipolar disorder. Found in about 2.6% in the population. It is genetic. It takes on average 7 years for someone to be accurately diagnosed, which is a strangely long time.

She showed a diagram of three neurotransmitters (dopamine, norepinephrine, serotin) that don’t chemically function well for her and listed some of the medications she takes.

“Would it feel as uncomfortable if I would talk about medicines for a body illness? No. Does it feel uncomfortable to talk about medicines for a mental illness? Yes.” The reasson: there is a stigma on it. Everybody knows the problem, but nobody talks about it. Education helps. Showing data helps. Talking about it helps. Getting it in front of people helps.

20% of people have some kind of mental illness. About 2.5% have a serious mental illness. Anxiety is common at 18.1%, depression at 6.9%, for instance.

Now... mental health in tech. There has been research at ‘open sourcing mental health’. One of the key results was that only 50% of the people that said they had a problem actually sought help.

What is the impact of mental wellness? You’re more productive. You’re more productive when you’re happy about your victories. You’re more productive if you aren’t anxious.

Mental health in tech also means talking about burnout. There are many possible reasons, but the most basic one is that we’re simply working too hard and not talking about it.

She won’t discuss the workload, but the second part: why don’t we talk about it? If you have a medical problem, you go to a doctor. For mental problems, you need to do introspection. You have to know there’s something to talk about.

She told some about her company where they regularly meet about these subjects, which helps people think about it. They also have coaching, which is another way to talk about it.

Another option: https://forums.osmihelp.org/, the “devpressed” forum where you can discuss it and read about it.

relaxing view

*Photo explanation: Relaxing view of a small, sleepy German town (Virneburg), very early in the morning mist. Sweet bliss. *

water-gerelateerd Python en Django in het hartje van Utrecht!

Django under the hood: debugging performance - Aymeric Augustin

2016-11-03

Tags: django, djangocon

(One of my summaries of a talk at the 2016 django under the hood conference).

Performance? Partially it is a question of perception. Up to 0.1s is “reacting instantaneously”. Up to 1s: “not interrupting the user’s flow of thought”. Up to 10 seconds is slow, but the user might keep waiting. More than 10 seconds and they’re off to check facebook.

To optimize something, we have to measure it. For instance page load time.

  • You could use your browser’s development tools to see how quick a page loads. But you’re measuring that on your fast development laptop.
  • In chrome, you can make your internet connection worse and simulate 3G speed, for instance.
  • Google analytics site speed can be used, it measures it in 1% of the cases a user looks at your page.

Performance timeline

Let’s look at what happens for one single request.

  • DNS lookup. This can take a surprising amount of time.
  • Establish a TCP connection.
  • Finally send over the request to the webserver.
  • You receive the first byte of the response. After a while the last byte comes in.
  • Page processing in the browser itself. Rendering the page and so.
  • onLoad javascript and so.

On average, the actual webserver processing takes only 15% of the time. The good news is that “it is not our backend’s problem”, the bad news is “we have to fix the speed anyway”.

The core: HTTP/1.1 is bad at fetching many resources. There are tricks like server-side concatenation, image sprites, inlining, daching. Client side you can do DNS pre-fetc, TCP pre-connect, keep-alive and pipelining, parallel connections and caching.

The front-end

There are three main stages.

  • Loading. No events, no js yet. Parse HTML and build the DOM. Download and run sync JS.
  • Interactive. DOMContentLoaded event. Download css and images and fonts. Parse css and build CSSOM.
  • Complete. The standard “onLoad” javascript running.

Now to the page rendering. HTML is converted into a DOM. CSS is converted into CSSOM. DOM and CSSOM are combined into a render tree, which is basically the DOM annotated with the CSSOM information. Only then are the fonts loaded and the layout determined and the page painted.

If you have JS, you’ll also start a javascript VM.

There are some surprising dependencies. The CSSOM has to be ready before any regular synchronous javascript is executed, for instance!

  • Rendering a page requires a DOM and CSSOM.
  • Building the DOM blocks on sync JS.
  • Executing JS blocks on the CSSOM.

Browsers luckily uptimize heaavily to keep page load time down. They parse HTML incrementally. It already paints while waiting for sync JS (after css is available). And it paints while waiting for web fonts (and re-renders after it has them).

Basic strategy:

  • Optimize HTML load time.
  • Optimize CSS load time
    • Unblocks first paint.
    • Allows js execution.

A trick you can use: async javascript. “Script-injected scripts”. This doesn’t block execution.

Another way of async javascript is to write it like <script async src="...">. This executes immediately without even waiting for the CSSOM to be ready.

The new best practice? Put critical javascript as async in your <head>. If there is some non-critical decorative js, put it as async at the bottom of your page.

Server side

The main stages:

  • The request comes in and a Request object is build.
  • We go through the middleware layers (the “before response” part of middleware).
  • URL dispatching and view calling. Perhaps also template rendering (which might take a surprising amount of time).
  • The middleware layers again (the after-response” parts).
  • Django passes the Response back to the browser.

Watch out for middleware: if one of your middlewares does a database query, that query will be done on each and every one of the requests. So be careful of doing expensive things in middleware!

In the view code, you can also do optimizations:

  • select_related(). This means you’ll automatically use a big join instead of many small individual queries. Useful for foreign keys. If you have a base object with 100 other objects pointing at it, grabbing the base object and the 100 others will mean 101 queries. With select_related you’ll have only one big one, which is much quicker.

  • prefetch_related(). Similar to above, only you’ll get two queries instead of the one big one with select_related. The first object grabs the base object and then python determines which sub-objects need to be fetched in the second query.

    prefetch_related works for every kind of relation. Foreign keys, backwards and forwards. Many2many fields.

    If you need to customize what gets prefetched, you can use a Prefetch object as argument to .prefetch_related().

    New in django 1.10: prefetch_related_objects(). This does the same as .prefetch_related(), only it works on model instances instead of on a queryset. So if you already have objects, you can still use prefetch-related speed-ups.

If you want to see if your database queries are fine, enable sql logging in your logging setup. (So send django.db.backends logging to the console).

Some small ORM optimization tips and tricks:

  • Use .only() or .defer() to limit the amount of data you grab per out of the database. But.... if you rarely need the data, perhaps you should move some of the data to a separate model that you can link.
  • Use .values_list() and .values() if you only need some specific data out of your database and if you don’t really need full-blown model instances. Instantiating model instances is very expensive. “Just” grabbing the actual data out of the database is much faster.
  • Use .aggregate() and .annotate() to do certain calculations (sum, average, count, and so on) in the database instead of in your python code. Especially when you need to manipulate large amounts of data.
  • Rarely used: .iterator() It iterates over instances, but doesn’t cache results. So you get only one instance per iteration. This is only needed when you need to conserve memory.
crowded signal box

Photo explanation: we got an explanation of how a signal box works. With ten people it sure was crowded. The signal man didn’t mind us bugging him.

water-gerelateerd Python en Django in het hartje van Utrecht!

Django under the hood: testing - Ana Balica

2016-11-03

Tags: django, djangocon

(One of my summaries of a talk at the 2016 django under the hood conference).

A quick history of django’s testing framework.

  • Ticket #2333 got added to django’s bug tracker before 1.0 was out: we want an integrated test framework (“Rails has it too”). A while later there was a test runner that looked at tests.py and models.py. models.py? Yes, as at that time doctests were still very popular and models were commonly tested with doctests. The rest was for the normal tests in tests.py.

  • Django 1.1 added a build-in testclient for basic GET/PUT. Also TransactionTestCase was added: this one rolled back database transactions at the end of the tests. Better performance.

  • 1.2 added a new class-based test runner. You could now also terminate the entire test run upon the first error (“failfast”).

  • 1.3 splits the old test client into an actual client and a RequestFactory. Well, the client is a subclass of RequestFactory, something Ana doesn’t like and would like to see refactored during the sprints.

    Doctests turned out not to be an ideal combination of tests and documentation. Testing was harder and the documentation not clear. So doctests were discouraged.

  • In 1.4, more TestCases were added. SimpleTestcase for tests without databases, for instance.

  • 1.5. Python 3 support lands in django. A full testing tutorial is added to the documentation. Several assert tests are added.

  • 1.6. “patch” is added to the supported methods of the build-in client. Test discovery is improved. Doctest discovery was removed.

  • 1.7 uses the new unittest. In an earlier version, the unittest2 library used to be backported, but the basic python unittest can be used now as old python versions have been deprecated: the basic unittest library includes all unittest2 functionality.

  • 1.8. Testcase is changed again. Fixture loading is sped up.

  • 1.9. --parallel is added: running multiple tests in parallel, if the tests support it. If you use an older django version, you might use nose’s multiprocessing plugin.

  • 1.10. Nicest feature: you can tag your tests to group them and exclude/run them as a group:

    @tag('slow')
    def test_something(self):
        ....
    

Running tests

Now on to what happens under the hood when you run the tests.

  • Set up the main test runner.
  • setup_test_environment() This sets up a locmem email backend. Translations are deactivated.
  • build_suite(): collect all the tests. The heavy lifting is done by python’s unittest framework. Django adds some functionality, like tags.
  • setup_databases()
  • run_suite()
  • teardown_databases()
  • teardown_test_environment()
  • suite_result(). Return the tests results.

Test classes

  • SimpleTestCase. Very fast, it doesn’t interact with the database. It does have access to the test client.
  • TransactionTestCase. Slow. It hits the database and does (necessary) transaction management to isolate the tests.
  • TestCase. Faster than TransactionTestCase.
  • LiveServerTestCase: launches a live http server in a separate test. Slow.
  • StaticLiveServerTestCase. Special version of the above.

(Note TODO for myself: investigate them further, I can probably speed up my tests by using a separate testcase!)

Quality

Django provides test functionality, but.... how do we write high quality tests? There are some tools to help us.

  • Use FactoryBoy. It replaces fixtures by easily creating model objects. It uses Faker to provide nice random data (person names, company names, email adresses).

  • Hypothesis: property based testing for python. This will run tests multiple times with random data to try and find corner cases. There is a django add-on for it.

  • Coverage testing. It is currently 76% for django. High coverage doesn’t mean high quality. She thinks it is a deceptive metric.

    (Personal note: I dislike this reasoning. Everybody (=several of my colleages) tend to say “high coverage doesn’t mean a thing”. Coverage metrics are discouraged when you say it like this. But, isn’t it obvious that a low coverage rate indicates bad quality tests? As most of the code isn’t tested? Yes, you cannot say tests are high quality if the coverage is high, but you can say the tests aren’t good enough if the coverage is low. So why is code coverage bad as a metric?)

  • Read the django tutorial on testing! This is a good explanation. Django tries to improve your code quality by making it as easy as possible for you to write tests. https://docs.djangoproject.com/en/stable/intro/tutorial05/

Improve test speed

Test speed is important. The quicker, the more often you run it.

  • Use MD5Passwordhasher in testing. Django does it.

  • Consider in-memory sqlite3.

  • Have more SimpleTestCase.

  • Use setUpTestCase() instead of setUp()

  • Be vigilant of what gets created in setUp()

  • Don’t save model objects if not needed. Is an in-memory model enough? So instead of Robot.objects.create(), use Robot().

  • Isolate the fast unit tests from the rests of the test. For instance by using SimpleTestCase. You can run those tests separate from your functional tests.

    For instance, you can run the fast tests very often during coding. When everything’s OK, you run all tests together.

testing his strength

Photo explanation: my son testing his strength by throwing a switch in an old signal box

water-gerelateerd Python en Django in het hartje van Utrecht!

Django under the hood: Channels - Andrew Godwin

2016-11-03

Tags: django, djangocon

(One of my summaries of a talk at the 2016 django under the hood conference).

Django channels was started by Andrew Godwin, best known for his work on South and django migrations.

Channels might seem like magic, but it is not. He’ll start by describing the problem and then the actual django channels work.

The problem: the web is changing

A lot of the web is becoming async. Web sockets and so. But it is not only websockets: long-polling, webrtc, MQTT, server-sent events. This doesn’t match django’s regular webpage behaviour very well.

  • Python is synchronous. Only the latest python 3 releases have async build-in.
  • But even then, Django is still synchronous at the core.

Synchronous code is easier to write. Async is much harder. Only do it if you really have to. Synchronous code is much easier to reason about. Single-process async is not good enough, so you’ll have multiple processes, threads and perhaps even machines.

You’ll need a proven design pattern that isn’t too hard to reason about. This is no place to do something weird and new. Multiple people must be able to maintain it. And if you use it, you’ll need many people who are able to program for it.

Loose coupling

Loose coupling is a good thing. The solution should not be too tied to websockets: there is more. And it should also not be too tied to Django: the ecosystem can be bigger that way.

You’ll need well-defined, minimal interfaces. It should be easy to swap parts.

A solution could be a message bus. A single place where you point all the pieces to. Django can talk to it, an HTTP server can talk to it. A socket server can talk to it. There are trade-offs and it is not perfect, but it does work well and is well-researched.

Now, how do you talk to it? How do you send messages through it? That is defined in a new standard called ASGI. The five things you can do:

  • Nonblocking send.
  • Blocking recieve.
  • Add to group.
  • Discard from group.
  • Send to group.

Basically, you send JSON-compatible dictionary-based messages onto named channels.

Concrete ideas

A web socket has “connect”, “accept/reject”, “receive”/”send” and “disconnect”. That’s the abstract concept of the websocket protocol. How do you map that to channels?

Everything is translated into a message that is send to a channel (on the message bus). So an “connect” event is send as a meessage to the “websocket.connect” channel. And so on. You have a websocket server that accepts websockets on one end and sends out messages to the message bus on the other end.

Any (django) worker on the bus can take the message and handle it. When sending it back, it has to be routed back to the specific socket server, as that one has the connection to the client. There is support for it (“reply channel” with a receiver indicator included after the channel name).

The system is not perfect: there are trade-offs, each with their own drawbacks. If something goes wrong, what do you do, for instance? Do you use at-most-once delivery? Then the message might be dropped. If you guarantee delivery, it might be delivered multiple times. Which one do you want? You’ll have to make those trade-offs deliberately. Some might sound fine, but they might cost you a lot of speed.

What has been done already: HTTP and websockets. Rough drafts for IRC, email and slack. Please don’t ever do this: minecraft and mainframe terminals.

A hard thing is ordering. Messages come in on the channel: what if they’re dependent on the order of them coming in? You can configure a specific channel to maintain ordering. An order key will be added on received messages and there’ll be explicit connection accceptance: this costs you some speed, though.

There are five packages that make up django channels.

  • Channels. The django integration.
  • asgi-redis. Redis backend.
  • asgi-ipc. local memory backend.
  • asciref. Shared code and libs.
  • Daphne. http/websocket server. (“I took Twisted and Autobahn and hot-glued it together”). You can use Daphne only for websockets and use gunicorn or another wsgi runner for the regular http requests.

Workers in channels: consumers based on views

The consumers of messages look very much like django views:

@channel_session
def chat_receive(message):
    ......
    example_django_object = Message.objects.create(
        name=...
        content=message["text"]
    )

Routing is based on URLs:

routing = [
    route("websocket.recieve",
           consumers.chat_receive,
           path=r"&/chat/socket/$"),
    include....
]

How do you maintain state? You could use the database backend. You can also use sessions. In django channels, sessions hang off reply channels, not of cookies. It uses the same sessions backends. It can also access long-term cookie sessions. So you can set a username into the session and then use that throughout the entire websocket session!

A controversial thing: there is no middleware. Because of the ways consumers are written, it is virtually impossible to capture incoming and outgoing messages like regular middleware does. Decorators replace most of the functionality you’d normally need middleware for.

Important: the regular django views are still there! They can work via channels, too, if needed.

What is the future

He’ll hope it will allow generalised async communication. There’s also the binary MQTT “internet of things” sensor data. It is not only html. Do you want micro services? Separate things per CPU? Send certain stuff to a specific server? Synchronous python 3 stuff? Perhaps even mix python 2 and 3...

Part of this channels work is still young, especially by django standards. He wants more implementations to make sure it works well. Daphne is OK, for instance, but Django is not in the business of writing web servers.

He hopes it will make django much faster.

He also explicitly wants some more co-maintainers. Also documentation, bug fixing, bug reporting.

Channels is now an official django project and close to a 1.0 release.

switch and signal wires

Photo explanation: Switch and signal wires being channeled in the correct direction in the cellar of an old signal post of the “ZLSM” touristic railway in the Netherlands.

water-gerelateerd Python en Django in het hartje van Utrecht!

DSA ssh keys also deprecated in OSX Sierra

2016-11-03

Tags: python, djangocon, apple

I’ve been using an ssh key for a long time. Back in the days when RSA keys were mostly commercial/proprietary and when DSA keys where the recommended method.

When ubuntu 16.04 came out, I suddenly couldn’t use my key anymore as ubuntu (or perhaps rather its updated ssh server) didn’t accept DSA keys anymore. They’re apparently not save enough anymore.

I worked around it by adding a setting on the server, telling it to accept my DSA key.

When the latest Apple OS update (Sierra) came out, I suddenly couldn’t log in anywhere. Also git pull to github (I’m using ssh) stopped working. I found the cause by passing -vvvv to ssh:

debug1: Next authentication method: publickey
debug1: Trying private key: /Users/reinout/.ssh/id_rsa
debug3: no such identity: /Users/reinout/.ssh/id_rsa: No such file or directory
debug1: Trying private key: /Users/reinout/.ssh/id_ecdsa
debug3: no such identity: /Users/reinout/.ssh/id_ecdsa: No such file or directory
debug1: Trying private key: /Users/reinout/.ssh/id_ed25519
debug3: no such identity: /Users/reinout/.ssh/id_ed25519: No such file or directory
debug2: we did not send a packet, disable method
debug3: authmethod_lookup password
debug3: remaining preferred: ,password
debug3: authmethod_is_enabled password
debug1: Next authentication method: password

Oh.... Sierra’s ssh client also doens’t like DSA anymore. Ok... time to create a new key :-)

... which I did about 6 weeks ago. With a nice, long passphrase. And 2 weeks ago I had to create a new one as I couldn’t for the life of me remember the exact passphrase. I’ve rarely felt so stupid :-) A passprase can have many almost-right variants:

I'll take the train to django under the hood
I'll take the train to django under the hood.
I'll take the train to Django under the Hood
I'll take the train to Django under the Hood.
I will take the train to django under the hood
I will take the train to django under the hood.
I will take the train to Django under the Hood
I will take the train to Django under the Hood.
I'm taking the train to Django under the Hood.
I'm taking the train to Django under the Hood
I'm taking the train to Django Under The Hood.
I'll go by train to Django Under The Hood.
etc...
etc...

Well, everything works again :-)

 
vanrees.org logo

About me

My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):