Djangocon: on the look-out for your data - Markus Holtermann

Tags: djangocon, django

(One of my summaries of a talk at the 2018 european djangocon.)

We have piles of data in our databases. We want to search in there.

What is “search”, actually? “Try to find something by looking or otherwise seeking carefully and thoroughly”. On the one hand, there’s “try”. On the other hand “careful and thorough”. So: search is hard.

What about searching in django? In a very basic sense, you already search if you do get_object_or_404(Article, pk=12)! And if you don’t find it, you return a “404 Not found”.

Searching text is more complicated. Text is unstructured data. So you’d have something like get_object_or_404(Article, text__icontains=search_parameter). This is not efficient for your database: WHERE text ILIKE `%your search parameter%'.

There is an index in postgres that’s much more efficient: trigrams. These chop up your text into various pieces of string that are 1, 2 or 3 characters long. See https://www.postgresql.org/docs/current/static/pgtrgm.html

It is not included in django, but you can write a custom index for it. The code fitted (readable!) on one slide.

When we talk about searching text, we often mean full text search. Word order doesn’t matter, for instance. And the exact verb form also doesn’t matter. In linguistics there is the term “stemming”. Computer, compute, computation is all “stemmed” to just “comput”.

Often you also ignore stopwords. “Django is the best” becomes “django best”.

You can get quite far with postgres’ build in search. It is integrated in django, see the documentation.

If you want to go further, you can look at xapian, sorl, whoosh, lucene, elasticsearch. Effectively a second database that you use just for textual search. As it is a second database, you need to keep them in sync. You could use transaction.on_commit() for this. Perhaps in your .save() method or in celery.

Same with deletion. But, as with the .save() method, watch out that there are some cases where those methods aren’t called. A queryset’s .delete() doesn’t call the individual objects’ .delete() method, for instance.

You could keep more info in your search engine, if you use elasticsearch for instance. Like complete article summaries and urls and so. Even if the search index isn’t completely up to date, the user will still get a result. Better than a 404.

You could even run most of your website from out of your search engine so that your website continues to work even when your regular database is temporarily switched off.

https://abload.de/img/screen_shot_2016_02_05pk3p.png

Photo explanation: constructing a viaduct module (which spans a 2m staircase) for my model railway on my attic.

 
vanrees.org logo

Reinout van Rees

My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):