Reinout van Rees’ weblog

My take on the “12 factor app”


Tags: django, nelenschuurmans is often quoted as a standard shopping list if you want to get your deployment right. At least in the Python web world it is, it seems to me.

I’m currently looking at the way we deploy stuff at our company (Nelen & Schuurmans). Partially by hand, lots via fabic, increasing use of ansible. And many infrastructure-parts like the main web proxy config essentially by hand, aided by scripts.

Not everything in the 12 factor app list is needed for us, but it helps me think about what we need to keep and what we need to improve.

  • One codebase tracked in revision control, many deploys. We use git/github well. We also have multiple deploys, this works OK.

  • Explicitly declare and isolate dependencies. Python packages and buildout. Pinning. Works fine. A few projects are less tidy, though, with git branch checkouts instead of tidy packages.

    Take care with javascript/css dependencies. The recent trend towards grunt, bower and so with a nice requirements file: looks good.

    On the whole, the way we compose the actual project works fine.

  • Store config in the environment. DATABASES out of We do this wrong now. Environment settings or, perhaps better, configuration in an /etc/xyz/sitename/ directory, is the way to go.

    Important: it suggest that “deploying the site” is a separate step from “get the evironment on the server”, right? You haven’t won anything if the database setting is outside of your but inside an ansible config file within the very same git project!

  • Treat backing services as attached resources. A database or a directory/share is something that is provided to you. So unless a site’s goal is to manage shares, it ideally should do any managing or setting-up of shares.

    Such a backing service must then of course be easy to get at and it must be something you can depend on. Once creating a database is out of your hands, it is easy to get frustrated if something is wrong or if the process is slow. This is where “reliability” and “ease of use” and “transparency” start to get really important.

    What about tools like memcached? Normally it is run on the webserver itself and often it is installed by fabric/ansible which installs the site. Should this be an external backing service or is it OK to have it run locally? Same with a local celery install, for instance. TODO: thinkwork :-)

  • Strictly separate build and run stages. This we don’t normally do. bin/buildout on the server will happily compile python packages when needed. And bin/grunt will collect its packages. Ideally, the download and compile step happens somewhere else.

    Bundling the eggs and/or wheels could help. You can generate a debian package out of a buildout, too. Or simply zip up the whole shebang. Depending on globally installed packages could remove the need for custom compilation.

    Two of our projects are using automatic integration deployments when the tests all run OK in Jenkins... Perhaps we can use the output of this to skip the build stage on the production server?

  • Execute the app as one or more stateless processes. The way we run things, we like to have both a running gunicorn and an nginx configuration and perhaps a celery deamon... That’s more than one stateless process.

    The stict separation advocated by 12factor is probably really useful for their (=heroku’s) server setup with many independent customers. It is less useful/necessary for us.

  • Export services via port binding. In the end, what goes out is mostly the nginx port 80. So this is OK.

    On the other hand... we do some hardcoded IP address backchannel idiocy somewhere. And in another project much stuff is harder-coupled than advocated by 12factor. Some of it is by necessity, some can be avoided.

  • Scale out via the process model. In a way we do, with multiple gunicorn processes. In a way we don’t, as the only real service that runs on more than one machine is our “lizard5” website. And lizard5 is awfully mis-configured for that usecase (non-shared folders, faulty caching). I still need to fix that.

  • Maximize robustness with fast startup and graceful shutdown. Nginx, gunicorn: we’re ok. Restarting a server is normally not a big deal.

    Restarting the gunicorns works through buildout-installed cronjobs with an @reboot time. Turning that into a proper system-level service would be neater.

  • Keep development, staging, and production as similar as possible. Everyone uses ubuntu, that’s quite a good start. Custom installed versions and PPAs and manual sudo easy_install -U .... tend to mess it up, though.

    Ansible is sometimes used to manage both development and staging and production. Take care to not do too much here: setting up a database locally for development can be fine, but it won’t match the production environment where the database is something outside of your direct control that you should threat as a “backing service”. And personal development environments are prone to personal preferences. Time will need to tell what the sweet spot is.

    Docker of course is something that is hard to match regarding development/production parity. Configuration is a problem here, as you don’t want to have so much parity that you’re developing against the production database! Time will tell.

  • Treat logs as event streams. This is something we don’t do yet. Everything is logged to /var/log and /srv/sitename/var/log. Do we need a infrastructure-wide logging server that everyone can send its logs to? Europython and talks seem to suggest that it is a very handy addition to sentry!

  • Run admin/management tasks as one-off processes. We’re mostly OK here with django’s management commands and other scripts we have lying around in our buildouts.

Ansible thoughts (plus questions for which I need your input)


Tags: django, nelenschuurmans

Now... how do we structure our servers? How do we put our software on them? How do we configure them?

I’ve done my share of server maintenance and devops and what have you. On linux, that is. Maintaining my own (web)server since 1996 or so. Done a lot with buildout (which means grabbing everything together for a site that has to do with python). Automated much of our installation with Fabric so that

$ bin/fab update production

was more or less the standard way to update a site. So I know my way around reasonably and I normally keep everything pretty neat and tidy.

Now Ansible has come along and several of our sites are managed with that tool. It feels like a much more elaborate version of Fabric. “Elaborate” is meant very positively here. And I want to dive into it and start using it too.

Fabric is great for executing commands via ssh on a remote server (including copying files, modifying files etc). Ansible also executes commands via ssh, but comes with a complete system for storing your configuration, handling variables, dividing up the workload into libraries and also for sharing those libraries (called “roles” in ansible-speak).

Problem description

I’ve only experimented with it a little bit. Apt-get update/upgrade on the two local laptops at home, that sort of stuff. Today I wanted to get backups working: just a basic rsync cronjob on the two laptops, rsyncing their stuff to my “” server, somewhere remote in England. Connection via a passwordless SSH connection.

Looks simple. Just add the cronjob and a private ssh key to the laptops, add the public ssh key to the server’s backup user and you’re ready to go. But it did not feel right: it are really two kinds of configuration that I want to keep separate.

The server is one that I manage together with my brother Maurits. The laptops at home are mine alone to manage. Different “organizations” if you want. Should they be mixed?

Ansible seems to like it best when it is, in the end, one big configuration. Of course you split it up a bit in roles, but when you want to install a new website, you want to add a database to the database server, add a new site to the external proxy configuration, install the django site on one of the webserver machines, perhaps make sure memcached is on that webserver machine, etc: for one site you need changes in multiple places on multiple machines.

This again, to me, looks like two organizations are mixed up. On the one hand the people that have to make sure the main external proxy keeps working and that the production database doesn’t get abused with 25 different test/staging/development sites and that nobody kills off four websites because he upgrades some dependency (gdal or numpy .deb package or so) for the sole other site on that machine that needs it.

On the other hand, the developers of the site want to make sure the server(s) are up to date and ready for the site. And that means adding a database, installing some packages, adding the url to the proxy, etc. Exactly the things that shouldn’t need to be modified all that often.

So... is this a lack of trust? No. Complete trust, actually, that something is going to get royally messed up in case everyone gets root access to all of our servers. I mean, if puppet (used by the sysadmins) gets de-installed from servers because someone thinks it is unused... Because packages got upgraded (“let’s add ubuntugis-unstable as PPA to our 12.04 LTS so that we can use the latest GDAL that we need”) followed by the four sites on the same server dying due to failed dependencies once the server rebooted.

So... Ansible can handle most of our servers and infrastructure and sites. But there are two kinds of configuration, I think:

  • Server/infrastructure configuration. Which LTS to install on which server? Front-end main web proxy config. Database server configuration. Which packages to install on which server. Which services like memcached or redis to run on the machines.
  • Site configuration. Special cases. Database to be installed. Nginx configuration. Necessary packages. Perhaps memcached or redis.

There’s quite some overlap between the two. But they develop at quite a different pace. Tricky, as they partially need each other’s data. The main proxy needs to know which site is on which server, for instance.

Current situation

We now have a base set of ansible roles we can reuse for multiple sites to set up a server and for setting up sites the way we want.

There are three or four sites where abovementioned base roles are used in combination with custom roles and custom config to actually configure and install the site on the server.

Handy way to install a site, whether on a local vagrant/virtualbox or on the production server. Same command, same setup.

You need full root access on the server, though, to get it all working. And you need “create database” permission on the database server. If there are other sites on the same server (as is the custom for us, now), you’ve suddenly got multiple sites ansible-bombarding the same poor server.

Doing it all in one configuration doesn’t sound like a perfect fit, either. Too many parallel projects, too many individual changes. For instance, someone locked himself out of a server after running ansible on it as someone had added a role to fix the sudoers file. The one running ansible wasn’t allowed to become root after his ansible script updated the sudoers file as that file was now tweaked to the needs of another one of the projects...

My impression

Ansible is hailed as the cure for all our installing woes. And for finally breaking free of the sysadmin mold (“I’m not allowed to do anything on this machine”). But I’m pretty sure that if everyone can just update/fix/improve everything with the same big all-encompassing ansible config, that it’ll be just a nice automatic way to shoot ourselves in the foot.

We’re, in practice, using it for installing individual projects. Sometimes even to manage two or three servers for one individual project. Servers specifically for that project: fine.

We’re not using it yet to manage all aspects of all our servers. Using our current setup in the all-encompassing way is not very safe-looking.

The problem is that we do talk about ansible as if it is one big configuration where everything is stored and that all the developers can now automatically manage everything ourselves. Our mental model doesn’t match up with reality.

My questions

My questions are a reality check, really:

  • We have lots of different kinds of sites (django 1.4, 1.5, 1.6, flask, static, socket). Each wants to use ansible to install itself on the server (staging/production) and also to set up a virtualbox for development. Is ansible suited for per-site setup work?
  • Ansible seems the right tool to manage a large collection of servers. Making sure the right packages are installed (“a couple of 12.04 ones with memcached pre-installed”). Making sure the right services run. This should be relatively stable, so allowing the current problems (someone mucks about on the server by hand) should not be automated, right? (So: not everyone can update the server).
  • The first point lets developers update the server just like they would the local development virtual machine. From within a site’s own custom ansible config. The second point lets sysadmins set up the right infrastructure from within one big ansible config. Those two don’t match, right? (At least not together at the same time).

Some options


  • Docker means effectively a per-site packaged virtual machine. So you can muck about it in as much as you want, basically. All the packages you want, all the nginx config you want, all the nginx/memcached/redis/whatever you want. Root access. Copy-able to your local dev machine. You can set up your docker instance with a custom ansible config.

    The docker instance has to be installed on a server somewhere, but (apart from some port forwarding and perhaps mounting a directory), there’s nothing much that needs configuring. Could be a nice combination with a infrastructure-wide ansible config, right?

    Docker seems to remove quite some of the belongs-to-site-or-infrastructure problems by allowing, effectively, a full server to be in user space, right?

  • The setup of a server could also be handled, for a big part, with debian packages. A site’s installer (probably ansible) can easily check if a package (“our-company-standard-LTS-with-memcached.deb is installed OK) is installed and it can safely assume that the environment is OK then.

  • You could extract this-should-be-done-globally tasks from a site’s ansible config. And write a program to send those over as pull request to the main infrastructure-wide ansible config. Things like “I want my port 8004 to be available via the main proxy as” or “I need a database on the staging database server”.

Anyway... any input appreciated! I want to hone (or fix) my thinking.

Mail tips to your colleagues


Tags: python, nelenschuurmans

tl;dr summary: mail a colleague if you see something that just might be useful to them.

A couple of weeks ago I made a django app way faster. The most important improvement was to use python’s build-in datetime.datetime.strptime() to parse dates instead of the iso8601 library. 20 times faster for our use case.

Last week, a colleague mailed me a link to with as email subject “you’re not the only one that found date parsing to be slow”. He just saw the article and mailed it to me. He didn’t even read it in detail.

To me, it looked quite relevant and I did read it in detail. Wow! A datetime parsing library that promises to be even faster? . Trying it out meant changing 5 lines or so. Wow! Another 6x speed increase compared to strptime()!

Good news for our customers. Good news for the django app. All because of a simple quick mail.

So... think about your colleagues when you spot something that might be especially useful! It might just be the link that makes their day.

No need to use get_context_data, use {{ view.some_method }}


Tags: django

There’s a super nice super elegant productivity trick for Django class based views. Something that is not very well know, as I discovered at last week’s djangocon. So... time to share it!

The problem? Getting information into your template’s context. What most people do is to overwrite get_context_data(), just like the example from the django 1.5 docs

def get_context_data(self, **kwargs):
    # Call the base implementation first to get a context
    context = super(PublisherDetailView, self).get_context_data(**kwargs)
    # Add in a QuerySet of all the books
    context['book_list'] = Book.objects.all()
    return context

In the template you’d use it like this:

{% for book in book_list %}
{% endfor %}

The solution? A new feature in Django 1.5 : Django’s generic class based views now automatically include a view variable in the context. This variable points at your view object. So... you basically never need to modify get_context_data() anymore! Example:

from django.views.generic import TemplateView

class SomethingView(TemplateView):
    template_name = "something.html"
    title = "My beautiful list of books"

    def books(self):
        return Book.objects.all()

    # Note: no get_context_data() here!

The template then uses the automatic view variable to grab the list of books and the title:

<h1>{{ view.title }}</h1>
  {% for book in view.books %}
    <li>{{ book }}</li>
  {% enfor %}

Isn’t that wonderful? It fits Django’s philosophy perfectly: the templates should be dumb, the processing should happen inside the view. Now it is dead easy to “just add a method” to calculate something, the method will automatically be available in the template as view.your_quick_helper_method.

No more tedious duplication like this:

from django.views.generic import TemplateView

class SomethingView(TemplateView):

    def books(self):
        return Book.objects.all()

    def get_context_data(self, **kwargs):
        context = super(SomethingView, self).get_context_data(**kwargs)
        context['books'] = self.books()
        # ^^^ bad duplication.
        return context

I mean, context['books'] = self.books(), that’s just plain unnecessary duplication. But... no need to do it anymore! Just call the method or grab the attribute directly.

Some notes:

  • It works for Django’s generic class based views. So if you subclass TemplateView or EditView or so, you’re fine.

    Behind the scenes, it is the ContextMixin that provides the basic get_context_data() method that injects the view variable into the context.

  • Django’s normal view restrictions apply. Attributes are fine. Methods with just self as a parameter are fine.

    Methods that need an extra parameter (def books(self, category)) won’t work, as django’s template language doesn’t allow you to call methods like that.

    Similarly, attributes and methods prefixed with an underscore (“private”) aren’t available.

  • Hm, I think this is an essential (and very handy! part of Django’s view mechanism. I added the original ticket. But... I notice now that it is hardly mentioned in the docs.

    Should this need to be better documented? I’m thinking about changing the adding extra context paragraph to “just stick a method/attribute on the view” instead of “mess with get_context_data()”. Is this a good idea?

Djangocon keynote: the programmer’s body - Daniele Procida


Tags: djangocon, django

(One of the summaries of a talk at the 2014

Daniele Procida talks about diversity in our industry.

Everybody seems to share the values and advantages of diversity as the python/django community. So he doesn’t need to sell them to us now.

What were the earliest computers? Women. The power of early computers was measured in “girl years”: the work of how many computing women could it do? Women were highly prized for their manual computing work because they were paid half the salary of male colleagues.

Programmers are now building the world. Anyone in the world will live in the world we build. The systems that govern us. The systems that govern our relations. The way we are educated. The way we are formed. Programmers are quite important.

So a valid question is “who are these programmers?” But: does the question matter? Doesn’t it matter more what we do rather than what we are? Liberalism says that what we do is much more important.

Liberalism abstracts from “particulars”. And it makes asusmptions and assertions of equality. It actually says that people should be equal. It asserts individual autonomy, responsibility and merit.

In case you don’t like those answers, you need to be quite lucky to belong to the right group.

The greek philosophers made a clear distinction between the mind and the body:

  • The mind is alive, essential, free, human.
  • The body is inert, accidental, determined, animal.

This opens up to the idea of “universal humanity”, abstracted from the particular. The differences do not count. It results in the idea of universal human rights. He showed a picture of Cocina Wurst, the Austrian man dressed up as a woman that won the european song festival last week. “It is time to worry about more important things than things like our sexuality”.

The ideology of the IT industry shares liberalism’s privileging of the mind over the body. It is redefining humanity. We’re defined by our online personalities and relationships. Who we are as a body doesn’t matter.

Facebook/Mark Zuckerberg’s vision of relations has re-defined our real-world relations. They recently bought VR maker Oculus: they also want to change our perception of the real world.

Technology changes the world. But it usually changed the world in respons to a real need. Now we have ideas in search for problems: so our ideas will now try to change our world.

So... the question “who are those programmers?” isn’t so strange. Mostly they’re quite young. Often men. Often white. Well-off. Educated. Western.

At least the ones who are driving this development often are.

Normally when people of one specific kind are all the doctors, all the lawmakers, all the politicians, all the programmers? What happens then?

Liberalism abstracts from bodies. Who’s bodies? The bodies of those who don’t need to think about their bodies. Well, bodies without problems. Male bodies. Bodies that don’t become pregnant. Bodies that aren’t old. Bodies that don’t have problems getting food. Bodies that are in a good position already.

Liberalism’s blind spot is that of the priviliged elites. Just like in the UX design talk with the car/cyclists analogy: when only car drivers design the road, the road will never be a good place for cyclists.

So liberalism’s philosophy isn’t giving our IT industry the philosophy we need. The IT industry has many issues with bodies. We often forget and damage and mistreat our bodies. Human relationships lead to controversies. Your stand on gay marriage can cost you your job as IT CEO, for instance.

Working in the IT could be the ideal job for a physically disabled person. But he hasn’t seen a wheelchair at djangocon yet. He sees more of them engaged in sporting activities in the park!

The bodies in our industry are very different. But there is a stunning lack of diversity. There are many disparities in confidence, pay, influence, participation. There are high-profile controversies regarding sexuality, sex, sexism. And there are dark sides to it like harrasment.

It is not safe for the world we are building if all the programmers are privileged white men.

Liberalism inadvertently undermines diversity. Liberalism undermines some of its own key aspirations like justice, fairness, equality, meritocracy because it cannot find a place for the body in its philosophy.

Question: what animal would you be if you were an animal?

Answer: we are already an animal. So we need to look to our animal, our body. Human reason, moral responsibility, knowledge: they are all embodied.

  • Liberalism fails to recognise the embodied nature of human reason.
  • This failure makes disputes angrier, because of the way it leads us to dispute.
  • Anger is exclusive and detrimental to diversity. So watch out when getting angry on twitter.

He showed a poem by Mary Oliver:

You do not have to be good.
You do not have to walk on your knees
For a hundred miles through the desert, repenting.
You only have to let the soft animal of your body
love what it loves.
Tell me about despair, yours, and I will tell you mine.
Meanwhile the world goes on.
Meanwhile the sun and the clear pebbles of the rain
are moving across the landscapes,
over the prairies and the deep trees,
the mountains and the rivers.
Meanwhile the wild geese, high in the clean blue air,
are heading home again.
Whoever you are, no matter how lonely,
the world offers itself to your imagination,
calls to you like the wild geese, harsh and exciting –
over and over announcing your place
in the family of things.

He told us another poem. Great! ‘to be of use’ by Marge Piercy.

Some comments:

  • Moral knowledge is practical, lived.
  • Learning to be considerate is a question of how to be, not of what to know.
  • People are generally kind, friendly, gentle, generous, but not always very considerate.
  • How is this kind of knowledge transmitted?

A conclusion:

  • Our python/django community already strongly embraces moral values.
  • We are in a position to assert values and act on them.
  • We influence the members of our community.
  • We influence other communities. They are watching us.

We are building the world.

French diesel train in Kruth, 2007

Djangocon: purist unit tests are a waste of time? - Harry Percival


Tags: djangocon, django

(One of the summaries of a talk at the 2014

Harry Percival, despite what you might guess due to the talk’s title is very testing-oriented. Just look at his website :-)

He demoed a quick addition to a small TODO list app. Using outside-in TDD (test driven development).

Outside-in? Just start at the outside (functional test that hits the TODO page). You try to click a not-yet-existing link. You add the link with a dummy url. You adjust the test. And so on.

And slowly you dive into more detail. And slowly you get to write more detailed tests. And slowly you descent from the functional test level to the unit test level.

A funny term for this is “programming with wishful thinking”. “I wished I had a list of TODO items”. So then you first mock it and then build it for real. “I wished I had a nice ‘List’ object”. Repeat.

He did the same thing with more unittests and mocking. But mocks aren’t fun and dirty. So he started cleaning up his code to get rid of the mocks. In the end, the code was cleaner and clearer than with his first effort.

Use unittests to force yourself to think about your code. This way your code will be better.

Main “vallée de la Doller” French railway station Burnhaupt in 2007.

Djangocon: pytest helps you write better django apps - Andreas Pelme


Tags: djangocon, django

(One of the summaries of a talk at the 2014

Andreas Pelme‘s slides are at .

Django already provides some nice tools to make testing easier. The testing experience can become even better, for instance by using pytest. Pytest is quite popular. The ones using it don’t want to use anything else. (He showed a couple of tweets of well-known django developers to prove it).

Pytest is exensible. Andreas is the maintainer of the pytest-django plugin.

The main advantage of pytest is that it allows you to write pythonic tests, without boilerplate.

  • No more self.assertEquals(a, b) but simply assert a == b.

    Advantage: the error message when it doesn’t match is much clearer than the regular test exception.

  • You don’t need to subclass from TestCase anymore. Just a function is enough. Django’s TestCase would normally give you self.client. The function will look something like def test_something(client): .... The django extension for pytest recognizes “client” and passes in the correct object automatically.

    That auto-recognized “client” attribute? Pytest calls that a “fixture”. No, that’s something else than django’s fixtures.

    Test fixtures are very handy for test dependencies.

You run the tests with py.test. For this, you need to have your project on the python path. So you need to do pip install -e ., so you need to have a

You need to tell pytest which tests to run. For this, add something into pytest.ini:


Note: regular django testcases still work! No problem. So you can get started right away without immediately needing to re-write all your existing tests.

Sometimes you don’t need to run all the tests. You can pass a directory or a testfilename to py.test, only those tests will be run, then.

Test organization. You can group tests into directories. By default, it looks for tests named test_*.py. You can change this.

Suggestion: split your tests into separate directories per kind of test. A directory for unit tests. One for integration tests. One for browser tests. It is up to you, though. It depends on the work you’re doing.

Speed: you can tell pytest to reuse the database from the last test run. Of course this is not a good idea when you’re making many model changes. It might be good if you often run a single test (couple of milliseconds) and if it takes a few seconds to set up your full database.

Look at pytest-xdist. It can run your tests in multiple processes. On a 4-processor machine, this cuts your testing time in half! Theres’ also an option for running tests on multiple separate machines.

Before you can use the django database in a test, you first have to mark the test with a decorator. If not, you’re not allowed to hit the database. This is useful for quick unittests. You can tell pytest to run all the tests that don’t require a database: these af often much faster to run.

He then demoed how to create a fixture. His example was one that returned a selenium Firefox webdriver to the test. You can configure a “scope”. With scope='session' you re-use the same object for all the tests. In some cases, this is handy. In other cases you need to limit the scope to an individual test run.

The test would look something like this:

def test_hello(live_server, firefox_driver):

You simply request the features you need in your test. Simple!

Note: when creating database objects, use factory_boy instead of doing it by hand.

Want to know more? Look at Holger Krekel’s talk on youtube.

My son admiring a French locomotive in 2007. Vallée de la Doller.

Djangocon: The whys and hows of using django formsets - Shena Priscilla Makini


Tags: djangocon, django

(One of the summaries of a talk at the 2014

Many people don’t use or know about django formsets.

She google for articles and found “formsets: the worst thing ever” and “formset gotcha: or how to lose your sanity”. Ouch.

What are form sets for? When you want to use the same form multiple times. For instance for a data grid. A table with one form per row, for instance. Django formsets help you put several of those forms on one page.

The example she showed was of changing your “mailman” settings for a couple of mailinglists on one page. It is used in postorius, a new web UI (django) for mailman.

It works something like this:

SomeFormset = formsets.formset_factory(SomeForm, extra=5)
formset = SomeFormset()

And in the template it is just:

{% formset %}

This gives it to you as a table. You could also iterate over it and get the individual forms:

{% for form in formset %}
{% endfor %}

(Personal note: oh blast, I didn’t know about this feature! I know just the spot where I desperately need to use it. That one app where I do some manual field generation and mess about with the IDs and a custom template. So... thanks for the talk!)

French “mallet” loco in 2007. Magnificent machine. Vallée de la Doller.

Djangocon: open data on the sea shore - Richard Moch


Tags: djangocon

(One of the summaries of a talk at the 2014

Richard Moch shows us open sea map, open street map for the sea! Light houses, shipping langes, etc.

They’re also trying to build up a sea depth map consisting of open data. See .

There’s more: . Yes, open weather data for current weather and forecasts. There is an open meteo foundation that tries to get it off the ground.

Based on this open weather data, you can build apps, for instance for showing good locations for doing paragliding.

Open weather data? Then you need to measure it. They are working on an open source autonomous weather station.

Modern French diesel unit in Kruth in 2007

Djangocon: good schema design and why it matters - Andrew Godwin


Tags: djangocon, django

(One of the summaries of a talk at the 2014

Andrew Godwin is a django core developer and he’s the author of South and its successor, the database migrations inside django itself.

Schemas can be explicit or implicit. Postgresql, mysql and so have explicit schemas, they’re regular . Redis, ZODB, couchDB: implicit, you basically “just” put data in (which in probably does have some kind of structure, of course).

A problem with implicit schemas is that you can get silent failures: a weight might be "85kg" instead of 74, so a string instead of an int.

Schemas inform storage, which might help with performance. And they enforce structure upon the data. The drawback? It enforces structure upon the data. It is less flexible.

When you add columns to your table or if you add an index, postgresql often behaves much better than mysql. It is faster and does much less locking. In case you use oracle or mssql, learn their strengths.

Well. Schemas. They will need to change. You can put your schema into a vcs, but there’s no guarantee that your data will survive the modification.

Django has migrations build-in now. But migrations aren’t enough: you can’t automate away a social problem. You still need to talk to your colleagues when you remove columns :-) And you often cannot merge migrations when they come from different git feature branches, at least it is not as easy as merging code: you’ll need to coordinate.

What about using a schemaless database? Just chuck some dict into a mongodb? Watch out, your data has to be a match for the database. You can easily end up with multiple copies of objects as there are no foreign keys.

You could work with a mix of schemas and schemaless. You could add a specific schemaless column (something like a jsonfield) for all those little things you want to store on an item that are just a pain to add to your schema. You won’t query on them, so they don’t need an index. Works well! Don’t make it too big, as you’ll hurt performance.

Databases can have different characteristics:

  • Read-heavy (a blog for instance),
  • Write-heavy load (every time you do something, you hit the DB, for instance for storing logging).
  • Large size.

How does your database for your app behave? What do you have to optimize on?

  • In-memory cache might be good if you have a small, read-heavy site.
  • The further down you come from “many reads”, the fewer indexes you should use. More indexes make writes slower.
  • Write heavy? Look at append formats (“append to a file” is very fast, it just isn’t optimized for reading).
  • Very very large size? Look at offline storage. (Amazon glacier, for instance).

Keep in mind: your load changes over time. Scaling isn’t a flat modifier: you’ll use your database in a different way in the start-up development/demo stage than in full production.

Some generic tips:

  • Write heavy? Fewer indexes. You can block too many. If you do it completely wrong, you can block the whole database.

  • Read heavy? Denormalize.

  • Very important: keep large data away from read/write heavy data. And keep fast data away from slow data. Django has 1:1 foreign keys, great for moving a user’s profile fields (that aren’t used a lot) into a separate table, keeping the often-foreign-key-linked user model nice and small.

  • Blob stores or filesystems are databases, too. A filesystem is effectively a key(“filename”)/value store for blobs!

  • Re-evaluate as you grow. Different things matter at different sizes.

  • Adding NULL columns is great. Always prefer this if nothing else. And use postgresql. He once saw that adding a column on a big database took 50 seconds on mysql and just a few miliseconds in postgresql.

  • As much as he likes postgresql: you’ll need more than one kind of database. Postgres + filesystem, for instance. Or mysql + redis.

    But don’t use too many, you’ll be swamped. And don’t run after the latest newest shiny database. And don’t trust benchmarks.

  • Indexes aren’t free: you pay the price at write/restore time.

  • Very useful: relational databases are flexible. They can do a lot more than JOINing normalized tables. You can do tricks like storing comma-separated values in a big string. You can denormalize. You could use special fields like postgresql’s arrayfield.

French diesel train in 2007 logo

About me

My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):