I hate your database - Andrew Godwin

Tags: django, djangocon

Andrew is most well-known because of south. So he’s familiar with lots of databases. And also with database warts. And also familiar with the ways in which people abuse databases.This talk is a collection of separate points to make us aware of the things that can be wrong with a database.

Kinds of databases

There are many different kinds of databases. Key/value, document, relational, graph, object/hierarchical, spatial, timeseries, search. Examples of the three most common ones:

  • Relational: postgresql, mysql, sqlite.

  • Document: mongodb, couchdb.

  • Key/value: redis, cassandra, riak.

Some theory first before we look at the problems.

  • ACID. Atomicity, consistency, isolation, durability. This is the main theory/rule behind most/many databases.

  • CAP theory. Consistency, availability, partition tolerance. You can only have two of those, according to the theory. So if a database claims to have all three, be suspicious.

Problems (and some good parts)

Relational databases.

  • Columns, rows. relations, normalization. Watch out with the way you write your queries! It is easy to be sub-optimal. Looping and doing a separate .save() for every individual item, for instance. 53 queries instead of 1.

  • Mysql has many warts. There are many things that don’t work at the same time. You either have one feature or another. MyISAM allows full-table locking, but no transactions. The query optimizer is poor. It is owned by oracle (which might be a bad thing). But for some tasks it is fast.

  • SQLite has little integrity checking. No concurrent access. It is impossible to do some table alterations. No concurrent access because it tries very hard to be a small, portable database. That’s also the good thing because it is very portable and easy to install :-)

  • Postgres has a slow default configuration. And it can be a little harder to learn. And it almost has too many features. But… it is incredibly reliable! It has all the SQL features. Transactions for everything. Great geospatial support. Great database. (Andrew really seemed enthousiastic about it.)

Document databases have no fixed schema. Low barrier to entry. Closer to python datatypes.

  • A generic problem is that they’re immature (but improving). You don’t need migrations in your database anymore, true. But now you need to handle the migration in your code. Schemas have their use: they are there for your benefit and your protection. So with a document database you have more risk.

  • There are no transactions.

  • No integrity checking.

  • Lots of small things like mongo not freeing disk space unless you drop and re-create the entire database. And losing data upon a crash (that’s better now).

Key/value databases. The good thing is that they’re fast, but the bad thing is that you can only query by key (which is why they’re fast).

  • They allow horizontal scaling. Just thrown in another box. The drawback is that the consistency isn’t perfect. You might not get the exact correct result.

  • You can only query by key. This means that they’re mostly only useful for things like caching and timestamped stuff.

  • There is often support for map-reduce queries which helps a bit with queries.

Spatial databases are strange beasts.

  • Knowledge of projections is useful. Every single country has their own projection, giving you conversion headaches.

  • Use spatial indexes. They really speed op some problems.

  • Generally, they’re an add-on to an existing database. For instance postgis for postgresql.

Filesystem. Yes, that’s a database. It is a hierarchical key/value store! Supports very large files. Stable. Well-known.

Graph databases. Very efficient for neighbour queries. For anything else? Not useful.

RRD, round robin databases. They deliberately lose old data! Useful, actually, for logging or statistics. You won’t need the old data anyway. It prevents you from running out of disk space.

Final thoughts

  • It is unlikely all your data all fits in one paradigm. You probably already store big files on the filesystem instead of in your database.

  • Just buying a bigger server goes a long way. Just a big big box instead of horizontal scaling. It might keep you from lots of unnecessary work and headaches.

  • If it sounds too good to be true, it probably is. Just be careful.

Feel free to try out databases. Different types of databases, too.

Photo & Video Sharing by SmugMug
 
vanrees.org logo

Reinout van Rees

My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):