The first “pyutrecht” meetup in Amersfoort in the Netherlands. (Amersfoort is not the city of Utrecht, but it is in the similarly named province of Utrecht).
I gave a talk myself about being more of a proper programmer to your own
laptop setup. Have a git repo with a
README explaining which programs you
installed. An install script or makefile for installing certain
tools. “Dotfiles” for storing your config in git. Etc. I haven’t made a
summary of my own talk. Here are the other three:
William works at deliverect, the host of the meeting. Webscraping means extracting data from a website and parsing it into a more useful format. Like translating a list of restaurants on a
There’s a difference with web crawling: that is following links and trying to download all the pages on a website.
robots.txt. As a crawler or scraper you’re supposed to read it
as it tells you which user agents are allowed and which areas of the website
are off-limits (or not useful).
Another useful file that is often available:
/sitemap.xml. A list of URLs
in the site that the site thinks are useful for scraping or crawling.
A tip: beautifulsoup, a python library for extracting neat, structured content from an otherwise messy html page.
selenium is an alternative as it behaves much more like a regular webbrowser. So you can “click” a “next” button a couple of times in order to get a full list of items. Because selenium behaves like a real webbrowser, things like cookies and IDs in query parameters and headers just work. That makes it easier to work around many kinds of basic protection.
A microcontroller is a combination of cpu, memory and some interfaces to external ports. https://micropython.org is a version of python for such low-power devices.
He demoed python’s prompt running on a raspberrypi micro connected via microUSB. And of course the mandatory lets-blink-the-onboard-LED programs. And then some other demoes with more leds and servos. Nice.
A big advantage of micropython is that it doesn’t care what processor you have. With C/C++ you specifically have to compile for the right kind of processor. With micropython you can just run your code anywhere.
You can use micropython in three ways:
As .py sources, uploaded to the microcontroller.
.mpy code, also uploaded.
.mpy included in the images
He showed a couple of possible target microcontrollers. A note to myself about
ESP8266: limited support, use
.mpy. I think I have a few of those
at home for should-test-it-at-some-time :-) Some examples: Pi RP2040, ESP32,
A problem: RAM is scarce in such chips and python is hungry… You can do some tricks like on-demand loading. Watch out when using an LCD graphic display, that takes 150kb easily.
You have to watch out for the timing requirements of what you want to do. Steering a servo is fine, but “neopixel” leds for instance needs a higher frequency of signals than micropython is capable of on such a microcontroller. If you use a C library for it, it works (he showed a demo).
Erik works as maintainer on the Graphene and the strawberry-GraphQL projects.
Graphql is a query language for APIs. It is an alternative to the well-known REST method. With REST you often have to do multiple requests to get all the data you have. And the answers will often give more information than you actually need.
With graphql, you always start with a graphql schema. You can compare it a bit to an openapi document. The graphql schema specifies what you can request (“a Meetup has a name, description, list of talks, etc”).
An actual query specifies what you want to get back as response. You can omit fields from the schema that you don’t need. If you don’t need “description”, you leave it out. If you want to dive deeper into certain objects, you specify their fields.
Strawberry is a graphql framework. It has
integrations for django, sqlalchemy, pydantic and more. The schemas are
defined with classes annotated with
@strawberry.type and fields with
python type hints. (It looked neat!)
He showed a live demo, including the browser-based query interface bundled with graphql.
Note: strawberry is the more modern project (type hints and so) and will later have all the functionality of graphene. So if strawberry’s functionality is enough, you should use that one.
“Everybody” uses stackoverflow. Now lots of people use chatgpt (or chatgpt plus). Stackoverflow traffic has dropped by 50% in the last 1.5 year. So chatgpt can be your coding buddy.
He really likes it for quickly getting something working (MVP). Like writing something that talks to a magento API (a webshop system). It would take him ages to figure it all out. Or he could ask chatgpt.
He also thinks you don’t need docstrings anymore: you can just ask chatgpt to explain a snippet of code for you. (Something I myself don’t agree with, btw).
(He demoed some chatgpt code generation of a sample website). What he learned:
Good briefing and interaction is key. First tell it what you want before you start to code.
Chatgpt sometimes loses track if the interaction goes on for too long.
Read what it gives you, otherwise you won’t know what it build for you.
Watch out for the “cut-off time” of the chatgpt training set: perhaps newer versions of libraries don’t work anymore with the generated code.
You get lazy.
You can get frustrated if you don’t understand what has been generated for you.
The world of education was a bit shocked by chatgpt. The instance they work for advices to be a bit careful, but allows it. But you’re not allowed to let chatgpt write parts of your official thesis, just like you’re not allowed to let a family member write it. Chatgpt usage can be treated as fraud.
Well, which tools can be used to search for possible fraud?
GTP-2 output detector
AI detector pro
Chatgpt (yes, you can ask it whether it looks like it wrote something).
They looked at 40 student reports from a variety of fields. Also both Dutch and English. And from between januari 2020 and june 2022, so before chatgpt could have been used. For every report, they made three summaries:
One by a human.
One by chagtgpt.
Chatgpt, but altered by QuillBot, which should make it look less recognizable.
So: 120 test samples in total. In the end copyleaks performed the best. The others didn’t do well.
LLM models can be huge. Mind-boggling huge. But… we can also have fun with small models.
He works a company that regulates climate installations in buildings (HVAC, heating, ventilation, air conditioning) via the cloud. Buildings use 30% of all energy worldwide. So improving how the HVAC installation is used has a big impact.
A use case: normally you pre-heat rooms so that it is comfy when you arrive. But sometimes the sun quickly warms the room anyway shortly afterwards. Can you not conserve some energy without sacrificing too much comfort?
You could calculate an optimal solution, but “just” measuring every individual room in combination with an AI.
An “edge device” inside the building.
An external API.
The API stores the data in mysql (the room metadata) and influxdb (the timeseries).
A user selects a room and a machine learning model type and a training data set (from historical data).
The software creates a dataset from influxdb, trains the model (pytorch). The trained neural network goes to ONNX (open neural network exchange). The output is stored in minio (S3-compatible object store). Note: all this is internal: no chatgpt or so.
With the business logic these predictions get interpreted and used for steering the heating. Normally you can achieve 3-5% savings.
The actual steering happens locally in the building with a “go” program that reads the ONNX data. It is open source and is called… gonnx :-)
They have a server with 1 GPU, which is enough for training all those models!
Alternative title: “the dark side of integrating a LLM (large language model) in your software”. You run into several challenges. He illustrates it with https://www.learntail.com/ , something he helped build. It creates quizes from text to make the reader more active.
What he used was the python library langchain to connect his app with a LLM. A handy trick: you can have it send extra format instructions to chatgpt based on a pydantic model. If it works, it works. But if you don’t get proper json back, it crashes.
Some more challenges:
There is a limit on prompt length. If it gets too long, the LLM won’t fully understand it anymore and ignore some of the instructions.
A LLM is no human being. So “hard” or “easy” don’t mean anything. You have to be more machine-explicit, like “quiz without jargon”.
The longest answer it provides is often the correct one. Because the data it has been trained on often has the longest one as the correct answer…
Limits are hard to predict. The token limit is input + output, so you basically have to know beforehand how many tokens the AI needs for its output.
Rate limiting is an issue. If you start chunking, for instance.
A LLM is not a proper API.
You need to do syntax checking on the answer.
Are all the fields present? Validation.
Are the answers of the right type (float/string/etc).
And hey, you can still write code yourself. You don’t have to ask the LLM everything, you can just do the work yourself, too. An open question is whether developers will start to depend too much on LLMs.
What is cancer? According to wikipedia: abnormal cell growth with the potential to invade or spread to other parts of the body. That is what you can observe. Medically, there are several aspects of cancer:
It prevents the cell from dying.
It can grab more than usual resources.
No sensitivity to the regular anti-growth signals.
AI starts getting used in clinics. For instance for proton therapy: where to best apply the proton radiation. And in radiology: letting AI look at images to detect cancer. A good AI can out-perform doctors. Analysis of blood samples, trying to detect cancer based on the DNA samples in there.
DNA mutations can also be detected, which is what he focuses on. Cancer is basically a “desease of the genome”. DNA is made up of T, C, G and A sequences. Technically, it is perfectly feasable to “read” DNA.
How do mutations occur? Exposure can leave “scars” in DNA. Damage can occur due to sunlight or smoking for instance. Specific sources result in specific kinds of damage: smoking has a “preference” for changing specific letters. With analysis, you can thus detect/estimate the cause of cancer.
A method to detect it is non-negative matrix factorisation. Normally you can only summarize the data in “hard” clusters: something is either A or B. With this technique, you can do “soft” clusters: something can be a little bit A and a bit more B.
Matrix factorisation is a way to relate separate data sources. For movies, you can have persons with preferences for comedy or aciton movies. And movies with a percentage action/comedy. Combined you get a matrix with estimates for the preference for every movie per user.
In a similar way, he creates a matrix relating cancer causes (like smoking) to specific observed types of DNA damage.
But… how reliable are the results? You can treat the matrix as a neural network. You can then use bayesian analysis to assess the probabilities.
He made a python packge for his reasearch: “mubelnet” (though I couldn’t find that online, btw).
AI is transforming cancer care. The only part it doesn’t affect is the actual nursing process.
Getting chatgpt to output valid json can be a chore:
> extract xxxx, output as json > extract xxxx, output as json list > extract xxxx, output as json with this schema > extract xxxx, output as json, aargh JSON I BEG YOU
Apparently they solved the json problem last monday. But he had the same problem when trying to get chatgpt to output only English and not Dutch. So the underlying problem is still there: you have to beg it to output in a certain way and hope it listens.
Some other problems are hallucinations: chatgpt telling you something with complete confidence, even though being wrong. And biases. And it is not really a chatbot, as it doesn’t ask questions. Unparseable output. Lack of explainability. Privacy issues as you’re sending data to servers in the USA.
And… what are the data sources chatgpt used? We don’t know. They’re called “openAI”, but they’re definitively not open.
When to use LLMs and when not to use them. Some good use cases:
Zero/few shot learning. A quick way to get a simple minimum viable product or proof of concept.
Data format transformation. html to json for instance.
You can use it to gather training data for easy bootstrapping.
Some bad use cases:
Structured classification tasks. You really want proper, neat output. Especially when you have lots of classes or a big context. For small personal projects it might be OK, but not for production.
Non-text classification… A large language model of course won’t help you with it.
When costs or energy consumption is important. Scaling is an issue.
When it is unclear who is responsible for what gets outputted. A chatbot generating “of course, you can get a refund” can be problematic if the customer really wants the refund it should not get…
When you really want to be sure you get the right answer.
What are some ideas you can look at?
gzip plus near-neighbor analysis. Compress text and see how similar they
are. It is not perfect, but it is a neat trick.
“Bag of words” plus “random forest” (a function from scipy).
Embeddings and a classifier. A LLM is used to annotate a dataset and you can then extract the interesting data and work with it.
What he thinks is important: keep humans in the loop. Prevent unwanted consequences. Add a preview step before sending stuff out into the world. Make classifications visible and allow corrections. Ask the user to label something if it is unclear. And don’t forget to audit the automatic classifications.
When all you have is a LLM, everything might start to look like a generative task. But don’t think like that. Who is going to use it? What is the actual problem? Spend some time thinking about it.
Alternative title: five reasons your boss doesn’t allow you to work on your LLM app idea.
Show of hands at the beginning. “Who has never used chatgpt”. I think I was the only one raising my hand :-) Lots of people are interested in it. According to google search queries, more people are interested in prompt engineering courses than in programming courses. Working in generative AI is a great work field at the moment.
Wijnand played a lot with it. He made a linkedin autoresponder, a whatsapp chatbot, a rap song generated, etc. To become enthousiastic about it he recommends checking out https://devday.openai.com/ .
There are several common drawbacks you can hear from your boss:
“Generative AI doesn’t comply with privacy laws”. Main reason: data is often hosted by big USA companies. Well, you can use azure in Europe. There are Dutch startups like Orquesta that help you pick the right ones. Complying with the GDPR is possible. You can also use local models.
“AI hallucinates and is unreliable”. He thinks it is mostly solved. Retrieval augmented generation is one of the methods you can look at. Or prompt chain techniques like manual validation prompts or enforcing explicit requirements.
“Too expensive”. Programmers are expensive and models also. So: look at smaller, cheaper models: you often don’t need the full chatgpt4. Use simpler prompts. Perhaps create your vectorisation once: then you can run your prompts practically for free. Oh, and chatgpt4 will drop its price by a factor of 3.
“The context window is too small”. (Chatgpt4 can consume bigger items since last monday, btw). Chunking/summarizing or vector embedding can also help. If you want it to write it an entire course, you can give it the initial question and ask it to generate a summary. From the summary a table of contents and from the TOC the individual chapters.
“Merging genAI with regular tools is hard”. You can ask chatgpt to reply
json. With the json output, you can then even feed it to javscript
During the talk, he showed off a project he is working on. A combination of chatgpt4 and web scraping, switching back between the two of them.
The biggest challenge he sees is to create something that won’t be taken over by OpenAI. So don’t compete with it but complement OpenAI. It is very hard to compete with them as they’re moving so quickly…
The company they work for is called “explosion”, so what can go wrong? :-)
SpaCy (https://spacy.io/) is a library for natural language processing. You give it text documents and you get them back with annotations.
Spacy mostly works with a pipeline. You always start with a tonenizer, afterwards multiple optional steps and at the end the annotated document.
A tokenizer splits op the text. The period at the end of a sentence
doesn’t belong to the last word, for instance, it is a separate
item. “Twitter’s” also is “twitter” and “‘s”. What comes out of the
tokenization project is a
Doc, which behaves as a list of
doc can be
A useful step: lemmatisation. The token
accepted is annotated with the
accept. This makes later searching easier.
directors has the
Span classification is entity recognition. A token
Musk is recognised
as a “person”. The tokens
april in combination can be a
“date”. The recognised entities and up as
You can do document classification. Categories like “newswire” or “love letter” with an attached estimation (“80% chance this is a newswire”).
Some of the transformers work with AI. Several kinds of pre-trained data are available. What they themselves use is the Groningen meaning bank (GMB), developed by the university of Groningen. More than 10k English texts, mostly newspaper texts from the public domain. You can also look at https://github.com/explosion/curated-transformers .
Spacy has its own plugins to provide annotations, but you can also plug in
your own. It is configured through a
.ini file. A project can be seen
as a sort of “makefile” for running everything. Assets (=remote sources you
want to have donwloaded), training data, what has to be run, the config, etc.
They showed a demo of how the whole system works. Looked nice and useful. You can play with the demo yourself: https://github.com/explosion/aiGrunn-2023
Compared to a LLM like chatgpt, at the moment targeted NLP often performs much better at classification.
“Branches are all you need: data versioning framework for machine learning”.
If you work with git and work with binary files, small changes give you a completely new copy. With a couple of changes, you quickly get a huge repository. Especially when you’re a machine learning
A solution could be an object store (like amazon s3). Name directories like versions, for intance. But quickly it becomes a mess. Oh, and which version in the object store matches the versioned model parameters in git? Aargh.
What is proper data versioning? The answer is git. That’s the only solution to keep track of everything. The core is to use branches. The branches effectivly contain links to files stored in object storage. There are tools for it like mlflow. You tell mlflow to upload/download the data, from your config in git. An alternative is git lfs for large files.
The main branch is for the readme, the documentation, definition of the business problem, onboarding information. There’s no data or code in here.
Data branches. First raw. Data first ends up here and never deleted. Branches point at specific versions/collections.
Development branches. This is a combination of code and data. But don’t change the data, only the code. Make sure you’re only developing in a dev branch, not in a data branch: you want to keep the two activities separated.
When finished, you can tag what you have.
Stable branches. For (re-)training and running tests.
Analysis branch. Mostly for comparing models, checking algorithms.
He has a demo at https://xethub.com/sdssio/branches-demo .
Statistics: charts of posts per year and per month.
My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):