Pycon NL: keynote: how not to get fooled by your data while AI engineering - Sofie van Landeghem¶

(One of my summaries of the Pycon NL one-day conference in Utrecht, NL).

(Sofie helps maintain FastAPI, Typer and spaCy; this talk is all about AI).

Sofie started with an example of a chatbot getting confused about the actual winner of an F1 race after disqualification of the winner. So you need to have a domain expert on board who can double-check the data and the results.

Let’s say you want your chatbot output to link to Wikipedia for important terms. That’s actually a hard task, as it has to do normalization of terms, differentiating between Hamilton-the-driver, Hamilton-the-town, Hamilton-the-founding-father and more.

There’s a measure for quality of output that’s called an “F-score”. She used some AI model to find the correct page and got a 79.2% F-score. How good or bad is it?

For this, you can try to determine a reasonable bottom line. “Guessing already means 50%” is what you might think. No, there are 7 million Wikipedia pages, so random guessing gives 0% F-score. Let’s pick all the pages which actually mention the word “Hamilton”. If we then look at more words like “Alexander Hamilton” or “Lewis Hamilton”, we can reason that a basic non-AI regular approach should get 78% at least, so the AI model’s 79.2% isn’t impressive.

The highest reachable quality depends on the data itself and what people expect. “Hamilton won at Spa”, do you expect Spa to point at the town or at the circuit? The room voted 60/40, so even the best answer itself can’t be 100% correct :-)

A tip: if you get a bad result, investigate the training data to see if you can spot some structural problem (which you can then fix). Especially if you have your own annotated data. In her example, some of the annotators annotated circuit names including the “GP” or “grand prix” name (“Monaco GP”) and others just the town name (“Spa”).

Some more tips:

Ensure your label scheme is consistent.
Draft clear annotation guidelines.
Measure inter-annotator agreement (IAA). So measure how much your annotators agree on terms. An article on F1 and politics: how many annotate it as politics and how many as F1?
Consider reframing your task/guidelines if the IAA is low.
Model uncertainty in your annotation workflow.
Identify structural data errors.
Apply to truly unseen data to measure your model’s performance.
Make sure you climb the right hill.

https://reinout.vanrees.org/images/2025/austria-vacation-8.jpeg

Unrelated photo from our 2025 holiday in Austria: just over the border in Germany, we stayed two days in Passau. View from the ‘Oberhaus’ castle on three rivers combining, with visibly different colors. From the left, the small, dark ‘Ilz’. The big, drab-colored one in the middle is the ‘Donau’ (so ‘schöne blaue Donau’ should be taken with a grain of salt). From the right, also big, the much lighter ‘Inn’ (lots of granite sediment from the Alps, here).

Reinout van Rees

My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):