(One of the summaries of a talk at the 2014 django under the hood conference).
Andrew Godwin wrote south, the number one migrations framework for django. It is superseded by django’s new built-in migrations, also written mostly by Andrew, hurray!
The original idea was to have a schema backend and hooks in the django ORM. The actual migration code would live outside of django in south2. In the end, everything is now in django. The original distinction between “schema backend stuff” and “the actual migrations” is still there in the code, however.
The schema backend is relatively simple and straightforward; the migration part is hard and hairy. The migration part contains: operations, loader/graph, executor, autodetector, optimiser, state. He’ll talk about some of them here.
What about the old syncdb? It is a one-shot thing: you add tables and then you add the foreign keys. When migrating, you have dependencies. You cannot add foreign keys to tables you haven’t added yet. There is automatic dependency-detecting code, now, but that was added quite at the last moment in the 1.7 beta 2…
Basic dependencies means the obvious stuff. Some examples:
Like creating a model before adding a foreign key to it. Most databases get fussy if you try to add the foreign key too early.
Create the model before creating the fields. Sounds simple, but you need to have these basics in place first in the dependency graph before you can get on to the hard cases.
Now on to the more creative dependencies.
For many-to-many fields you need both target models first before you add the actual M2M model that points at the targets.
Multi table inheritance? Create the MTI parent before the child.
“Unique together” constraints need to be done after adding the fields. Same for “index together”.
“Order with respect to” is a rarely used feature that adds an extra field with an ordering based on a foreign key field. He started dispairing when he discovered this feature.
Proxy models. Weird things need to happen when you actually turn proxies into real models and want to do that in a migration. It is a one-line change for a developer, but it makes for nightmares in the migration code. “You have to create the model with the same name before you can delete the model with the same name”. Yes, that’s not a typo.
Swappable models! Please take them away! Swappable models? For instance the
User
model that you can replace with a different custom model. Suddenly
a migration that you already applied might need to point at a different
model. Rolling back the migrations is no option, as that leads to data loss.
It works fine if you do it at the start of the project.
He used a different mindset when developing django’s migrations as opposed to how he developed South. South depended on people reading the documentation. Which they often don’t do. So they could shoot themselves in the foot quite well. Instead, django’s migrations are much more bulletproof, so there is much less need for reading the documentation in detail.
There’s a main loop in the migrations code that tries to find dependencies, shifts operations to satisfy the dependency, checks if everything is fine, and loops again and again until it is right.
The way it works is by chopping all operations into tiny dependencies. Every individual field that has to be created is turned into a tiny dependency step. After the list of steps is sorted (via the dependency-resolving loop) into the correct list of steps, an optimiser goes through the list and optimises it. If a model gets created and deleted, nothing needs to be done, for instance.
This kind of reducing could be dangerous. So there’s another loop that checks which reductions/simplifications are possible. Whether there are conflicts. It is better to have no optimisation than to have a wrong optimisation.
Reduction is applied after various stages: after the automatically detected dependency code. After applying the manual dependencies. And after squashing.
Squashing: it makes your history a bit shorter. It squashes migrations into a new starting point. This is especially handy when you’re a third party app developer.
The final part of the puzzle is the graph. It builds a directed graph of all basic migrations in memory. It needs to read all the models on disk for that. It also looks in the database. There’s a table in there that marks which migrations (or rather: nodes in the migration graph) have been applied.
A squashed migration lists the graph nodes that it replaces. A squash can only be applied if all the replaced nodes have the same state. They either all are unapplied: then the squash is applied. If they’ve all been applied, the squash can be considered as applied.
There’s room for improvement!
The autodetector is slow.
The optimizer is not great.
Graph state building is inefficient. Very inefficient. It might take 30 seconds. Building the state itself isn’t that slow, but it simply happens thousands of times.
It is mostly a case of un-optimized code. Big, pretty dumb, loops. So: everyone’s invited to help out, for instance at the sprint.
If you want to look at the code, here are some pointers:
django/db/migrations/autodetector.py
, start at _detect_changes()
django/db/migrations/optimizer.py
, start at reduce()
django/db/migrations/graph.py
django/db/migrations/loader.py
His slides are at https://speakerdeck.com/andrewgodwin/migrations-under-the-hood
My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):