This is an Atom formatted XML site feed. It is intended to be viewed in a Newsreader or syndicated to another site. Please visit Atom Enabled for more info.
The first "pyutrecht" meetup in Amersfoort in the Netherlands. (Amersfoort is not the city of Utrecht, but it is in the similarly named province of Utrecht).
I gave a talk myself about being more of a proper programmer to your own laptop setup. Have a git repo with a README explaining which programs you installed. An install script or makefile for installing certain tools. "Dotfiles" for storing your config in git. Etc. I haven't made a summary of my own talk. Here are the other three:
William works at deliverect, the host of the meeting. Webscraping means extracting data from a website and parsing it into a more useful format. Like translating a list of restaurants on a
There's a difference with web crawling: that is following links and trying to download all the pages on a website.
Important: robots.txt. As a crawler or scraper you're supposed to read it as it tells you which user agents are allowed and which areas of the website are off-limits (or not useful).
Another useful file that is often available: /sitemap.xml. A list of URLs in the site that the site thinks are useful for scraping or crawling.
A handy trick: looking at the network tab when browsing the website. Are there any internal APIs that the javascript frontend uses to populate the page? Sometimes they are blocked from easy scraping or they're difficult to access due to creative headers or authentication or cookies or session IDs.
A tip: beautifulsoup, a python library for extracting neat, structured content from an otherwise messy html page.
selenium is an alternative as it behaves much more like a regular webbrowser. So you can "click" a "next" button a couple of times in order to get a full list of items. Because selenium behaves like a real webbrowser, things like cookies and IDs in query parameters and headers just work. That makes it easier to work around many kinds of basic protection.
A microcontroller is a combination of cpu, memory and some interfaces to external ports. https://micropython.org is a version of python for such low-power devices.
He demoed python's prompt running on a raspberrypi micro connected via microUSB. And of course the mandatory lets-blink-the-onboard-LED programs. And then some other demoes with more leds and servos. Nice.
A big advantage of micropython is that it doesn't care what processor you have. With C/C++ you specifically have to compile for the right kind of processor. With micropython you can just run your code anywhere.
You can use micropython in three ways:
He showed a couple of possible target microcontrollers. A note to myself about the ESP8266: limited support, use .mpy. I think I have a few of those at home for should-test-it-at-some-time :-) Some examples: Pi RP2040, ESP32, Teensy 4.1.
A problem: RAM is scarce in such chips and python is hungry... You can do some tricks like on-demand loading. Watch out when using an LCD graphic display, that takes 150kb easily.
You have to watch out for the timing requirements of what you want to do. Steering a servo is fine, but "neopixel" leds for instance needs a higher frequency of signals than micropython is capable of on such a microcontroller. If you use a C library for it, it works (he showed a demo).
Erik works as maintainer on the Graphene and the strawberry-GraphQL projects.
Graphql is a query language for APIs. It is an alternative to the well-known REST method. With REST you often have to do multiple requests to get all the data you have. And the answers will often give more information than you actually need.
With graphql, you always start with a graphql schema. You can compare it a bit to an openapi document. The graphql schema specifies what you can request ("a Meetup has a name, description, list of talks, etc").
An actual query specifies what you want to get back as response. You can omit fields from the schema that you don't need. If you don't need "description", you leave it out. If you want to dive deeper into certain objects, you specify their fields.
Strawberry is a graphql framework. It has integrations for django, sqlalchemy, pydantic and more. The schemas are defined with classes annotated with @strawberry.type and fields with python type hints. (It looked neat!)
He showed a live demo, including the browser-based query interface bundled with graphql.
Note: strawberry is the more modern project (type hints and so) and will later have all the functionality of graphene. So if strawberry's functionality is enough, you should use that one.
(One of my summaries of the 2023 Dutch aiGrunn AI conference in Groningen, NL).
"Everybody" uses stackoverflow. Now lots of people use chatgpt (or chatgpt plus). Stackoverflow traffic has dropped by 50% in the last 1.5 year. So chatgpt can be your coding buddy.
He really likes it for quickly getting something working (MVP). Like writing something that talks to a magento API (a webshop system). It would take him ages to figure it all out. Or he could ask chatgpt.
He also thinks you don't need docstrings anymore: you can just ask chatgpt to explain a snippet of code for you. (Something I myself don't agree with, btw).
(He demoed some chatgpt code generation of a sample website). What he learned:
Some dangers:
(One of my summaries of the 2023 Dutch aiGrunn AI conference in Groningen, NL).
The world of education was a bit shocked by chatgpt. The instance they work for advices to be a bit careful, but allows it. But you're not allowed to let chatgpt write parts of your official thesis, just like you're not allowed to let a family member write it. Chatgpt usage can be treated as fraud.
Well, which tools can be used to search for possible fraud?
They looked at 40 student reports from a variety of fields. Also both Dutch and English. And from between januari 2020 and june 2022, so before chatgpt could have been used. For every report, they made three summaries:
So: 120 test samples in total. In the end copyleaks performed the best. The others didn't do well.
]]>(One of my summaries of the 2023 Dutch aiGrunn AI conference in Groningen, NL).
LLM models can be huge. Mind-boggling huge. But... we can also have fun with small models.
He works a company that regulates climate installations in buildings (HVAC, heating, ventilation, air conditioning) via the cloud. Buildings use 30% of all energy worldwide. So improving how the HVAC installation is used has a big impact.
A use case: normally you pre-heat rooms so that it is comfy when you arrive. But sometimes the sun quickly warms the room anyway shortly afterwards. Can you not conserve some energy without sacrificing too much comfort?
You could calculate an optimal solution, but "just" measuring every individual room in combination with an AI.
Technical setup:
They have a server with 1 GPU, which is enough for training all those models!
]]>(One of my summaries of the 2023 Dutch aiGrunn AI conference in Groningen, NL).
Arjan is known for his programming videos.
Alternative title: "the dark side of integrating a LLM (large language model) in your software". You run into several challenges. He illustrates it with https://www.learntail.com/ , something he helped build. It creates quizes from text to make the reader more active.
What he used was the python library langchain to connect his app with a LLM. A handy trick: you can have it send extra format instructions to chatgpt based on a pydantic model. If it works, it works. But if you don't get proper json back, it crashes.
Some more challenges:
A LLM is not a proper API.
And hey, you can still write code yourself. You don't have to ask the LLM everything, you can just do the work yourself, too. An open question is whether developers will start to depend too much on LLMs.
]]>(One of my summaries of the 2023 Dutch aiGrunn AI conference in Groningen, NL).
What is cancer? According to wikipedia: abnormal cell growth with the potential to invade or spread to other parts of the body. That is what you can observe. Medically, there are several aspects of cancer:
AI starts getting used in clinics. For instance for proton therapy: where to best apply the proton radiation. And in radiology: letting AI look at images to detect cancer. A good AI can out-perform doctors. Analysis of blood samples, trying to detect cancer based on the DNA samples in there.
DNA mutations can also be detected, which is what he focuses on. Cancer is basically a "desease of the genome". DNA is made up of T, C, G and A sequences. Technically, it is perfectly feasable to "read" DNA.
How do mutations occur? Exposure can leave "scars" in DNA. Damage can occur due to sunlight or smoking for instance. Specific sources result in specific kinds of damage: smoking has a "preference" for changing specific letters. With analysis, you can thus detect/estimate the cause of cancer.
A method to detect it is non-negative matrix factorisation. Normally you can only summarize the data in "hard" clusters: something is either A or B. With this technique, you can do "soft" clusters: something can be a little bit A and a bit more B.
Matrix factorisation is a way to relate separate data sources. For movies, you can have persons with preferences for comedy or aciton movies. And movies with a percentage action/comedy. Combined you get a matrix with estimates for the preference for every movie per user.
In a similar way, he creates a matrix relating cancer causes (like smoking) to specific observed types of DNA damage.
But... how reliable are the results? You can treat the matrix as a neural network. You can then use bayesian analysis to assess the probabilities.
He made a python packge for his reasearch: "mubelnet" (though I couldn't find that online, btw).
AI is transforming cancer care. The only part it doesn't affect is the actual nursing process.
]]>(One of my summaries of the 2023 Dutch aiGrunn AI conference in Groningen, NL).
Getting chatgpt to output valid json can be a chore:
> extract xxxx, output as json > extract xxxx, output as json list > extract xxxx, output as json with this schema > extract xxxx, output as json, aargh JSON I BEG YOU
Apparently they solved the json problem last monday. But he had the same problem when trying to get chatgpt to output only English and not Dutch. So the underlying problem is still there: you have to beg it to output in a certain way and hope it listens.
Some other problems are hallucinations: chatgpt telling you something with complete confidence, even though being wrong. And biases. And it is not really a chatbot, as it doesn't ask questions. Unparseable output. Lack of explainability. Privacy issues as you're sending data to servers in the USA.
And... what are the data sources chatgpt used? We don't know. They're called "openAI", but they're definitively not open.
When to use LLMs and when not to use them. Some good use cases:
Some bad use cases:
What are some ideas you can look at?
What he thinks is important: keep humans in the loop. Prevent unwanted consequences. Add a preview step before sending stuff out into the world. Make classifications visible and allow corrections. Ask the user to label something if it is unclear. And don't forget to audit the automatic classifications.
When all you have is a LLM, everything might start to look like a generative task. But don't think like that. Who is going to use it? What is the actual problem? Spend some time thinking about it.
]]>(One of my summaries of the 2023 Dutch aiGrunn AI conference in Groningen, NL).
Alternative title: five reasons your boss doesn't allow you to work on your LLM app idea.
Show of hands at the beginning. "Who has never used chatgpt". I think I was the only one raising my hand :-) Lots of people are interested in it. According to google search queries, more people are interested in prompt engineering courses than in programming courses. Working in generative AI is a great work field at the moment.
Wijnand played a lot with it. He made a linkedin autoresponder, a whatsapp chatbot, a rap song generated, etc. To become enthousiastic about it he recommends checking out https://devday.openai.com/ .
There are several common drawbacks you can hear from your boss:
During the talk, he showed off a project he is working on. A combination of chatgpt4 and web scraping, switching back between the two of them.
The biggest challenge he sees is to create something that won't be taken over by OpenAI. So don't compete with it but complement OpenAI. It is very hard to compete with them as they're moving so quickly...
]]>(One of my summaries of the 2023 Dutch aiGrunn AI conference in Groningen, NL).
The company they work for is called "explosion", so what can go wrong? :-)
SpaCy (https://spacy.io/) is a library for natural language processing. You give it text documents and you get them back with annotations.
Spacy mostly works with a pipeline. You always start with a tonenizer, afterwards multiple optional steps and at the end the annotated document.
A tokenizer splits op the text. The period at the end of a sentence doesn't belong to the last word, for instance, it is a separate item. "Twitter's" also is "twitter" and "'s". What comes out of the tokenization project is a Doc, which behaves as a list of tokens. doc[9] can be 's.
A useful step: lemmatisation. The token accepted is annotated with the lemma accept. This makes later searching easier. directors has the lemma director.
Span classification is entity recognition. A token Musk is recognised as a "person". The tokens 25 and april in combination can be a "date". The recognised entities and up as doc.ents[number].
You can do document classification. Categories like "newswire" or "love letter" with an attached estimation ("80% chance this is a newswire").
Some of the transformers work with AI. Several kinds of pre-trained data are available. What they themselves use is the Groningen meaning bank (GMB), developed by the university of Groningen. More than 10k English texts, mostly newspaper texts from the public domain. You can also look at https://github.com/explosion/curated-transformers .
Spacy has its own plugins to provide annotations, but you can also plug in your own. It is configured through a .ini file. A project can be seen as a sort of "makefile" for running everything. Assets (=remote sources you want to have donwloaded), training data, what has to be run, the config, etc.
They showed a demo of how the whole system works. Looked nice and useful. You can play with the demo yourself: https://github.com/explosion/aiGrunn-2023
Compared to a LLM like chatgpt, at the moment targeted NLP often performs much better at classification.
]]>(One of my summaries of the 2023 Dutch aiGrunn AI conference in Groningen, NL).
"Branches are all you need: data versioning framework for machine learning".
If you work with git and work with binary files, small changes give you a completely new copy. With a couple of changes, you quickly get a huge repository. Especially when you're a machine learning
A solution could be an object store (like amazon s3). Name directories like versions, for intance. But quickly it becomes a mess. Oh, and which version in the object store matches the versioned model parameters in git? Aargh.
What is proper data versioning? The answer is git. That's the only solution to keep track of everything. The core is to use branches. The branches effectivly contain links to files stored in object storage. There are tools for it like mlflow. You tell mlflow to upload/download the data, from your config in git. An alternative is git lfs for large files.
He has a demo at https://xethub.com/sdssio/branches-demo .
]]>