PostgreSQL when it is not your job - Christophe Pettus

Tags: django, djangocon

(Note beforehand: I didn’t get everything. You really also have to look at his slides for the details. Thanks Christophe for posting the slides right afterwards!)

Update: this article was mentioned on ycombinator. There’s a bunch of extra info and more tips there. I want to re-iterate that I could not just type over his slides; there are a couple of details (“more than 32GB memory in your server”) that I left out as I only’ve got 10 fingers :-). Anyway, on to the actual summary! End of update.

Just open a postgresql configuration file. Your first impression will be “this looks like the dashboard of an ancient boeing 747” and “we’re all going to die”. 500 options and knobs you can twist. But… it can also be very easy if you know what to do.

It actually is pretty hard to seriously misconfigure postgres. Almost all performance problems are application problems. Here are the main things you need to look at:

  • Logging. Be generous with logging. You mostly won’t notice the performance impact. And it helps a lot to find performance problems. Look at his slides for the piece of config you need to just copy/paste.

  • Memory. Only four really matter.

    • shared-buffers. below 2GB: set it to 20% of full memory, below 32GB: 25% of your full memory.

    • work_mem. Start low at 32/64MB. Look for temporary file lines in logs. Then set it to 2-3x the largest temp file that you see. This setting can give a huge speed boost (if set properly).

    • maintenance_work_mem. Set it to 10% of system memory.

    • effective_cache_size is only a hint to postgres. Just set it to the amount of filesystem cache available.

  • Checkpoints. Postgres doesn’t write everything directly to disk as that kills performance. There are some checkpoint settings you can set to tune when postgres writes.

    • wal_buffers = 16mb

    • checkpoint_completion_target = 0.9

    • checkpoint_timeout = 10m (to 30m depending on startup time)

  • Planner settings.

    • (Just look at his slides. I cannot type that fast).

After tweaking those settings, you never have to touch your configuration again. There are other tips for performance:

  • Disable the linux OOM killer.

  • Do not run anything besides postgresql on the host.

Stupid DB tricks you should not do:

  • Sessions in the DB. Use redis or so for it.

  • Constantly-updated counters.

  • Celery task queues in the database.

  • Big files in the database. Put them on the filesystem where they belong.

  • Very long-running transactions.

  • Using .save() all the time for creating many objects. Just use postgres’s COPY.

Some quick tricks you could look at on your application’s side:

  • Don’t use gigantic IN clauses. Django generates them a lot. JOINs are fine, but IN isn’t well-supported in postgres.

  • Use postgres’s full text search instead of like %this%.

  • Don’t randomly create indexes. Maintaining an index takes time, too. Look at pg_stat_user_indexes and pg_stat_user_tables to see whether indexes are needed. Get a database expert to tweak your indexes instead of slapping on index=True in your Django model fields.

On the database server, you can switch off the auto-vacuum process and schedule VACUUM for specific low-load times. VACUUM also does an ANALYZE. You ought to do an ANALYZE also after loading lots of data. This helps postgres optimize your queries.

Note that once in a very long while (once a year or so), postgres does a VACUUM FREEZE. Real housekeeping. To prevent it from bringing your app to its knees at a random moment, do the VACUUM FREEZE every few months yourself.

Always monitor postgres, at least display and system load. Memory and IO is very handy. Use 1-minute intervals.

Backups? pg_dump works fine. But… there’s also streaming replication. Easy to set up. And… it maintains an exact logical copy of the database on a different host. (But really really make sure it is on a different machine!) Such a streaming replication doesn’t safeguard at all against accidental errors, as it’ll also replicate your accidental ‘drop table’ just fine.

And… preplicas can be used for read-only copies! And you could do a pg_dump on such a read-only copy.

WAL archiving is great for security. It does take some more time to set up, but it helps you sleep at night.

Watch out with encoding: the encoding is fixed at database create time. So just create it with utf-8.

Schema migration can take a lot of time, especially if a column with a default value is added: every row needs updating. Try to allow NULL in your columns if at all posible. So don’t set NOT NULL unless absolutely needed. (Update: I wrote it down backwards, I originally said you should stick to NOT NULL. See one of the comments below.) Enforce mandatory fields in your django code instead.

For hardware, use your own physical hardware. Especially good IO. Raid 10 for the main DB, raids for the logs. Don’t use a SAN unless it is really good.

And… watch out with Django’s transactions. You might want to manage your own transactions instead of using Django’s automatic one. See his weblog at http://thebuild.com, he writes a lot about Django transactions there.

(In reaction to a question: postgis for geographical data is great.)

Photo & Video Sharing by SmugMug
 
vanrees.org logo

About me

My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):