Djangocon: from icontains to search - Honza Kral¶

(One of the summaries of a talk at the 2014 djangocon.eu.)

Honza Kral talks about searching, which is a big topic. He’ll focus on unstructured search: fulltext search. Search is an interface to your data.

Textual search:

$ grep -ri django *

or:

>>> Documents.objects.filter(
...     Q(title__icontains='django') |
...     Q(body__icontains='django'))

Don’t do this. None of them are going to scale. And you’re going to miss things. If you search for “running” you’ll miss “runs”, for instance.

The first index in world history was the 1230 bible concordance. Just a simple list of biblical words with pointers to the passages in which the words occurs.

We can do the same. Technically it is an inverted index. We can do more than just looking up individual words. We can also look at phrases. In those cases you look both at the files the words occur in, but also at their place in those documents.

We can do more things when storing words in our index:

Leave out common words like ‘a’ and ‘the’.
Lowercase everything.
Normalize words (normalize various verb variants to just one, normalize single/plural).
Look for synonyms. (Fast is almost the same as rapid, for instance).

All this happens at index time. So the computation expensive part only happens once. There’s no free lunch, though: in case you change your rules, you typically need to re-index everything.

Of course the same steps are also done with the search string.

We can do relevancy calculations! We know exactly how often a word occurs, so we know if it is common or rare. If you find a match with a rare word, that’s probably a good result. The calculation used is the “lucene similarity” formula. You basically calculate positive and negative factors and turn that into a score per result.

When showing the results, we don’t want to show the lowercased normalized word, but the actual one. So per item, we store the original term, the normalized version and the location.

Something else: facets and filtering. We can group per facet and filter on it. See github: search for something and you’ll also get a list of languages (“15000 ‘python’ results for your search ‘django’”). Facets and filtering really help when people are searching for something. It prevents them from floundering.

Per facet you define a bucket (and possibly sub-buckets) and place files in them based on some criterium. Every bucket can then have statistics (“count”, “average”).

Filtering happens per bucket. It doesn’t need analysis: something is either in a bucket or not. It is very fast and very cacheable. It is perfect for building a faceted navigation.

You can also do phrase suggestions. If you search for “djangi” you could suggest “django”. You use a calculation to determine if there are other oft-occurring terms that have a lot more “hits”.

In python you can search using elasticsearch. It hase a nice low-level python API.

There’s also a bulk() method that can bulk-upload models into the search index. There’s also one for uploading a single instance.

Some extra tips:

Many people use search for auto-completion. This is a bad match, because speed is an issue. And relevance makes no sense. It is better to use a specific completion suggester. The score and the text are supplied by the site. You can for instance provide a list of the most popular github repositories and feed that to the outocompleter.
Don’t sort! This destroys the relevancy. And you have to read everything.

When you sort, you normally want to influence the score. You can do this the neat way by supplying a script or by taking a different field into account or by boosting specific documents. This is much better than plain sorting.

http://reinout.vanrees.org/images/2014/IRM_and_ic_plus_in_Rotterdam.jpg

Dutch intercity trains in Rotterdam central station

Reinout van Rees

My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):