(One of my summaries of a talk at the 2018 european djangocon.)
We have piles of data in our databases. We want to search in there.
What is “search”, actually? “Try to find something by looking or otherwise seeking carefully and thoroughly”. On the one hand, there’s “try”. On the other hand “careful and thorough”. So: search is hard.
What about searching in django? In a very basic sense, you already search if
you do get_object_or_404(Article, pk=12)
! And if you don’t find it, you
return a “404 Not found”.
Searching text is more complicated. Text is unstructured data. So you’d have
something like get_object_or_404(Article,
text__icontains=search_parameter)
. This is not efficient for your database:
WHERE text ILIKE `%your search parameter%'
.
There is an index in postgres that’s much more efficient: trigrams. These chop up your text into various pieces of string that are 1, 2 or 3 characters long. See https://www.postgresql.org/docs/current/static/pgtrgm.html
It is not included in django, but you can write a custom index for it. The code fitted (readable!) on one slide.
When we talk about searching text, we often mean full text search. Word order doesn’t matter, for instance. And the exact verb form also doesn’t matter. In linguistics there is the term “stemming”. Computer, compute, computation is all “stemmed” to just “comput”.
Often you also ignore stopwords. “Django is the best” becomes “django best”.
You can get quite far with postgres’ build in search. It is integrated in django, see the documentation.
If you want to go further, you can look at xapian, sorl, whoosh, lucene,
elasticsearch. Effectively a second database that you use just for textual
search. As it is a second database, you need to keep them in sync. You could
use transaction.on_commit()
for this. Perhaps in your .save()
method
or in celery.
Same with deletion. But, as with the .save()
method, watch out that there
are some cases where those methods aren’t called. A queryset’s .delete()
doesn’t call the individual objects’ .delete()
method, for instance.
You could keep more info in your search engine, if you use elasticsearch for instance. Like complete article summaries and urls and so. Even if the search index isn’t completely up to date, the user will still get a result. Better than a 404.
You could even run most of your website from out of your search engine so that your website continues to work even when your regular database is temporarily switched off.
Photo explanation: constructing a viaduct module (which spans a 2m staircase) for my model railway on my attic.
My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):