Djangocon: good schema design and why it matters - Andrew Godwin¶

(One of the summaries of a talk at the 2014 djangocon.eu).

Andrew Godwin is a django core developer and he’s the author of South and its successor, the database migrations inside django itself.

Schemas can be explicit or implicit. Postgresql, mysql and so have explicit schemas, they’re regular . Redis, ZODB, couchDB: implicit, you basically “just” put data in (which in probably does have some kind of structure, of course).

A problem with implicit schemas is that you can get silent failures: a weight might be "85kg" instead of 74, so a string instead of an int.

Schemas inform storage, which might help with performance. And they enforce structure upon the data. The drawback? It enforces structure upon the data. It is less flexible.

When you add columns to your table or if you add an index, postgresql often behaves much better than mysql. It is faster and does much less locking. In case you use oracle or mssql, learn their strengths.

Well. Schemas. They will need to change. You can put your schema into a vcs, but there’s no guarantee that your data will survive the modification.

Django has migrations build-in now. But migrations aren’t enough: you can’t automate away a social problem. You still need to talk to your colleagues when you remove columns :-) And you often cannot merge migrations when they come from different git feature branches, at least it is not as easy as merging code: you’ll need to coordinate.

What about using a schemaless database? Just chuck some dict into a mongodb? Watch out, your data has to be a match for the database. You can easily end up with multiple copies of objects as there are no foreign keys.

You could work with a mix of schemas and schemaless. You could add a specific schemaless column (something like a jsonfield) for all those little things you want to store on an item that are just a pain to add to your schema. You won’t query on them, so they don’t need an index. Works well! Don’t make it too big, as you’ll hurt performance.

Databases can have different characteristics:

Read-heavy (a blog for instance),
Write-heavy load (every time you do something, you hit the DB, for instance for storing logging).
Large size.

How does your database for your app behave? What do you have to optimize on?

In-memory cache might be good if you have a small, read-heavy site.
The further down you come from “many reads”, the fewer indexes you should use. More indexes make writes slower.
Write heavy? Look at append formats (“append to a file” is very fast, it just isn’t optimized for reading).
Very very large size? Look at offline storage. (Amazon glacier, for instance).

Keep in mind: your load changes over time. Scaling isn’t a flat modifier: you’ll use your database in a different way in the start-up development/demo stage than in full production.

Some generic tips:

Write heavy? Fewer indexes. You can block too many. If you do it completely wrong, you can block the whole database.
Read heavy? Denormalize.
Very important: keep large data away from read/write heavy data. And keep fast data away from slow data. Django has 1:1 foreign keys, great for moving a user’s profile fields (that aren’t used a lot) into a separate table, keeping the often-foreign-key-linked user model nice and small.
Blob stores or filesystems are databases, too. A filesystem is effectively a key(“filename”)/value store for blobs!
Re-evaluate as you grow. Different things matter at different sizes.
Adding NULL columns is great. Always prefer this if nothing else. And use postgresql. He once saw that adding a column on a big database took 50 seconds on mysql and just a few miliseconds in postgresql.
As much as he likes postgresql: you’ll need more than one kind of database. Postgres + filesystem, for instance. Or mysql + redis.

But don’t use too many, you’ll be swamped. And don’t run after the latest newest shiny database. And don’t trust benchmarks.
Indexes aren’t free: you pay the price at write/restore time.
Very useful: relational databases are flexible. They can do a lot more than JOINing normalized tables. You can do tricks like storing comma-separated values in a big string. You can denormalize. You could use special fields like postgresql’s arrayfield.

http://reinout.vanrees.org/images/2014/django1.jpg

French diesel train in 2007

Reinout van Rees

My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):