This is an Atom formatted XML site feed. It is intended to be viewed in a Newsreader or syndicated to another site. Please visit Atom Enabled for more info.
The first "pyutrecht" meetup in Amersfoort in the Netherlands. (Amersfoort is not the city of Utrecht, but it is in the similarly named province of Utrecht).
I gave a talk myself about being more of a proper programmer to your own laptop setup. Have a git repo with a README explaining which programs you installed. An install script or makefile for installing certain tools. "Dotfiles" for storing your config in git. Etc. I haven't made a summary of my own talk. Here are the other three:
William works at deliverect, the host of the meeting. Webscraping means extracting data from a website and parsing it into a more useful format. Like translating a list of restaurants on a
There's a difference with web crawling: that is following links and trying to download all the pages on a website.
Important: robots.txt. As a crawler or scraper you're supposed to read it as it tells you which user agents are allowed and which areas of the website are off-limits (or not useful).
Another useful file that is often available: /sitemap.xml. A list of URLs in the site that the site thinks are useful for scraping or crawling.
A handy trick: looking at the network tab when browsing the website. Are there any internal APIs that the javascript frontend uses to populate the page? Sometimes they are blocked from easy scraping or they're difficult to access due to creative headers or authentication or cookies or session IDs.
A tip: beautifulsoup, a python library for extracting neat, structured content from an otherwise messy html page.
selenium is an alternative as it behaves much more like a regular webbrowser. So you can "click" a "next" button a couple of times in order to get a full list of items. Because selenium behaves like a real webbrowser, things like cookies and IDs in query parameters and headers just work. That makes it easier to work around many kinds of basic protection.
A microcontroller is a combination of cpu, memory and some interfaces to external ports. https://micropython.org is a version of python for such low-power devices.
He demoed python's prompt running on a raspberrypi micro connected via microUSB. And of course the mandatory lets-blink-the-onboard-LED programs. And then some other demoes with more leds and servos. Nice.
A big advantage of micropython is that it doesn't care what processor you have. With C/C++ you specifically have to compile for the right kind of processor. With micropython you can just run your code anywhere.
You can use micropython in three ways:
He showed a couple of possible target microcontrollers. A note to myself about the ESP8266: limited support, use .mpy. I think I have a few of those at home for should-test-it-at-some-time :-) Some examples: Pi RP2040, ESP32, Teensy 4.1.
A problem: RAM is scarce in such chips and python is hungry... You can do some tricks like on-demand loading. Watch out when using an LCD graphic display, that takes 150kb easily.
You have to watch out for the timing requirements of what you want to do. Steering a servo is fine, but "neopixel" leds for instance needs a higher frequency of signals than micropython is capable of on such a microcontroller. If you use a C library for it, it works (he showed a demo).
Erik works as maintainer on the Graphene and the strawberry-GraphQL projects.
Graphql is a query language for APIs. It is an alternative to the well-known REST method. With REST you often have to do multiple requests to get all the data you have. And the answers will often give more information than you actually need.
With graphql, you always start with a graphql schema. You can compare it a bit to an openapi document. The graphql schema specifies what you can request ("a Meetup has a name, description, list of talks, etc").
An actual query specifies what you want to get back as response. You can omit fields from the schema that you don't need. If you don't need "description", you leave it out. If you want to dive deeper into certain objects, you specify their fields.
Strawberry is a graphql framework. It has integrations for django, sqlalchemy, pydantic and more. The schemas are defined with classes annotated with @strawberry.type and fields with python type hints. (It looked neat!)
He showed a live demo, including the browser-based query interface bundled with graphql.
Note: strawberry is the more modern project (type hints and so) and will later have all the functionality of graphene. So if strawberry's functionality is enough, you should use that one.
(One of my summaries of the 2023 Dutch aiGrunn AI conference in Groningen, NL).
"Everybody" uses stackoverflow. Now lots of people use chatgpt (or chatgpt plus). Stackoverflow traffic has dropped by 50% in the last 1.5 year. So chatgpt can be your coding buddy.
He really likes it for quickly getting something working (MVP). Like writing something that talks to a magento API (a webshop system). It would take him ages to figure it all out. Or he could ask chatgpt.
He also thinks you don't need docstrings anymore: you can just ask chatgpt to explain a snippet of code for you. (Something I myself don't agree with, btw).
(He demoed some chatgpt code generation of a sample website). What he learned:
Some dangers:
(One of my summaries of the 2023 Dutch aiGrunn AI conference in Groningen, NL).
LLM models can be huge. Mind-boggling huge. But... we can also have fun with small models.
He works a company that regulates climate installations in buildings (HVAC, heating, ventilation, air conditioning) via the cloud. Buildings use 30% of all energy worldwide. So improving how the HVAC installation is used has a big impact.
A use case: normally you pre-heat rooms so that it is comfy when you arrive. But sometimes the sun quickly warms the room anyway shortly afterwards. Can you not conserve some energy without sacrificing too much comfort?
You could calculate an optimal solution, but "just" measuring every individual room in combination with an AI.
Technical setup:
They have a server with 1 GPU, which is enough for training all those models!
]]>(One of my summaries of the 2023 Dutch aiGrunn AI conference in Groningen, NL).
Arjan is known for his programming videos.
Alternative title: "the dark side of integrating a LLM (large language model) in your software". You run into several challenges. He illustrates it with https://www.learntail.com/ , something he helped build. It creates quizes from text to make the reader more active.
What he used was the python library langchain to connect his app with a LLM. A handy trick: you can have it send extra format instructions to chatgpt based on a pydantic model. If it works, it works. But if you don't get proper json back, it crashes.
Some more challenges:
A LLM is not a proper API.
And hey, you can still write code yourself. You don't have to ask the LLM everything, you can just do the work yourself, too. An open question is whether developers will start to depend too much on LLMs.
]]>(One of my summaries of the 2023 Dutch aiGrunn AI conference in Groningen, NL).
What is cancer? According to wikipedia: abnormal cell growth with the potential to invade or spread to other parts of the body. That is what you can observe. Medically, there are several aspects of cancer:
AI starts getting used in clinics. For instance for proton therapy: where to best apply the proton radiation. And in radiology: letting AI look at images to detect cancer. A good AI can out-perform doctors. Analysis of blood samples, trying to detect cancer based on the DNA samples in there.
DNA mutations can also be detected, which is what he focuses on. Cancer is basically a "desease of the genome". DNA is made up of T, C, G and A sequences. Technically, it is perfectly feasable to "read" DNA.
How do mutations occur? Exposure can leave "scars" in DNA. Damage can occur due to sunlight or smoking for instance. Specific sources result in specific kinds of damage: smoking has a "preference" for changing specific letters. With analysis, you can thus detect/estimate the cause of cancer.
A method to detect it is non-negative matrix factorisation. Normally you can only summarize the data in "hard" clusters: something is either A or B. With this technique, you can do "soft" clusters: something can be a little bit A and a bit more B.
Matrix factorisation is a way to relate separate data sources. For movies, you can have persons with preferences for comedy or aciton movies. And movies with a percentage action/comedy. Combined you get a matrix with estimates for the preference for every movie per user.
In a similar way, he creates a matrix relating cancer causes (like smoking) to specific observed types of DNA damage.
But... how reliable are the results? You can treat the matrix as a neural network. You can then use bayesian analysis to assess the probabilities.
He made a python packge for his reasearch: "mubelnet" (though I couldn't find that online, btw).
AI is transforming cancer care. The only part it doesn't affect is the actual nursing process.
]]>(One of my summaries of the 2023 Dutch aiGrunn AI conference in Groningen, NL).
Getting chatgpt to output valid json can be a chore:
> extract xxxx, output as json > extract xxxx, output as json list > extract xxxx, output as json with this schema > extract xxxx, output as json, aargh JSON I BEG YOU
Apparently they solved the json problem last monday. But he had the same problem when trying to get chatgpt to output only English and not Dutch. So the underlying problem is still there: you have to beg it to output in a certain way and hope it listens.
Some other problems are hallucinations: chatgpt telling you something with complete confidence, even though being wrong. And biases. And it is not really a chatbot, as it doesn't ask questions. Unparseable output. Lack of explainability. Privacy issues as you're sending data to servers in the USA.
And... what are the data sources chatgpt used? We don't know. They're called "openAI", but they're definitively not open.
When to use LLMs and when not to use them. Some good use cases:
Some bad use cases:
What are some ideas you can look at?
What he thinks is important: keep humans in the loop. Prevent unwanted consequences. Add a preview step before sending stuff out into the world. Make classifications visible and allow corrections. Ask the user to label something if it is unclear. And don't forget to audit the automatic classifications.
When all you have is a LLM, everything might start to look like a generative task. But don't think like that. Who is going to use it? What is the actual problem? Spend some time thinking about it.
]]>(One of my summaries of the 2023 Dutch aiGrunn AI conference in Groningen, NL).
The company they work for is called "explosion", so what can go wrong? :-)
SpaCy (https://spacy.io/) is a library for natural language processing. You give it text documents and you get them back with annotations.
Spacy mostly works with a pipeline. You always start with a tonenizer, afterwards multiple optional steps and at the end the annotated document.
A tokenizer splits op the text. The period at the end of a sentence doesn't belong to the last word, for instance, it is a separate item. "Twitter's" also is "twitter" and "'s". What comes out of the tokenization project is a Doc, which behaves as a list of tokens. doc[9] can be 's.
A useful step: lemmatisation. The token accepted is annotated with the lemma accept. This makes later searching easier. directors has the lemma director.
Span classification is entity recognition. A token Musk is recognised as a "person". The tokens 25 and april in combination can be a "date". The recognised entities and up as doc.ents[number].
You can do document classification. Categories like "newswire" or "love letter" with an attached estimation ("80% chance this is a newswire").
Some of the transformers work with AI. Several kinds of pre-trained data are available. What they themselves use is the Groningen meaning bank (GMB), developed by the university of Groningen. More than 10k English texts, mostly newspaper texts from the public domain. You can also look at https://github.com/explosion/curated-transformers .
Spacy has its own plugins to provide annotations, but you can also plug in your own. It is configured through a .ini file. A project can be seen as a sort of "makefile" for running everything. Assets (=remote sources you want to have donwloaded), training data, what has to be run, the config, etc.
They showed a demo of how the whole system works. Looked nice and useful. You can play with the demo yourself: https://github.com/explosion/aiGrunn-2023
Compared to a LLM like chatgpt, at the moment targeted NLP often performs much better at classification.
]]>(One of my summaries of the 2023 Dutch pythonconferentie python meeting in Utrecht, NL).
There's deep magic in python. Some of it is really hard to understand. Guido van Rossum: "from day one, there was deep magic hiding in some places, designed to quietly help users". Sebastiaan is going to show us a nice example of python magic: how "self" gets injected into method calls on a class:
class Guitar: def __init__(self, name): self.name = name def play_note(self, note): print(f"{self.name} plays {note}")
You can instantiate a guitar and call my_guitar.play_note("B") on it. With just the note as a parameter. But where is the self coming from? Somehow your instance of the Guitar class is magically inserted.
Note: self is not a magical keyword. You can give it a different name and it will still work (though your code will be unreadable and people will hate you).
Everything in python is an object. A class is an object (an instantiation of Type). A method inside a class is a regular function object. Plus it is also added with the name of the function as an attribute to the class (the namespace of the class).
If you call my_guitar.play_note("B"), your my_guitar is just an object. It does not have the attribute pointing at the function in its namespace. But it does find the function, as python also looks up attributes in the class of an object: this is what "self" comes from.
You can see it when you look at the method: Guitar.play_note says it is a "function". my_guitar.play_note says it is a "bound function".
The magic that is happening is done with "dunder methods":
Functions all have a __get__() method. When accessed on a class, you'll get the function back. When accessed on an instance, it returns a PyMethod, which is the function bound to the instance where the instance is passed as the first argument.
There are some decorators that are related to this:
These "descriptors" allow you to customize how attributes work.
Book recommendation: "fluent python".
]]>(One of my summaries of the 2023 Dutch pythonconferentie python meeting in Utrecht, NL).
Jodie wants to look at large language models (like chatgpt), which went into full-on hype mode this year. Let's look at some historical examples of what people have thought to be artificial intelligence.
Now chatGPT: people think it displays real artificial general intelligence. But what is the reality? Can we look at it more scientifically? A well known article ("sparks of artifical general intelligence") claims to use categories from an older article to rank it: reasoning, planning, problem solving, abstract thinking, comprehending complex ideas, learning quickly and from experience.
Only... Jodie has lots of experience in psychology and those are not categories that are used to gauge intelligence. And the older article also couldn't be found.
A common problem with artificial intelligence is that it is only considered artificial until you explain it. When we know how a machine does something intelligent, it ceases to be regarded as intelligent.
Another problem is that artificial intelligence is often very focused and goal-oriented. It performs impressively on one specific task and totally not on others. Don't give a math problem to chatgpt...
There are several levels of intelligence:
As a human, you're generally intelligent. You can learn several broad abilities. Which allows you to accomplish tasks. So there are several levels.
General intelligence maps to the extreme generalisation level. Broad maps to broad. No/local generalisation to tasks. This is a good way to think about AI, too.
So if you look at tasks: generalisation is difficult. How many ways are there of solving the task? How many examples are there? How much experience do you need? How high is the value of achieving intelligence? That can be a way of determining the intelligence at the task.
Learning in AI is often done through brute force: lots and lots of examples. If a problem is too far outside of the original training set... Chatgpt hasn't exactly released what they trained on, but it at least 2021-and-earlier models. When asked to solve programming puzzles that were available on the internet in 2021, chatgpt had a 100% score. "So we don't need programmers anymore".
But when asked to solve puzzles from 2022, it failed misserably... When asked, chatgpt even said where it got the answers from the pre-2021 data.
A better question than is this real generic AI is to ask where can this be realistically used. And programmer AI tools like copilot are one of the better use cases, actually. The added benefit is that there's quite some extra validation you can do on the output (code syntax checkers, the interpreter, etc.)
]]>(One of my summaries of the 2023 Dutch pythonconferentie python meeting in Utrecht, NL).
She showed "the zen of python":
>>> import this The Zen of Python, by Tim Peters Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those!
She herself wondered whether she fully understood those. Are they suggestions? Aphorisms? The law? She took the book "Gödel, Escher, Bach". Mentioned in there, are the three layers of any message:
Simple is better than complex: fine. But "complex is better than complicated"? What do those terms mean exactly? Complex is difficult to separate or to solve. Complicated is difficult to understand or explain. Aha.
Readability counts. Adequate variable/module naming. A well-designed structure. Linting. Flat is better than nested, sparse is better than dense. Documentation is important. She thanked Daniele Procida for his work on this.
Freedom. Freedom of will. Freedom of choice. Python is in a sense an exercise in freedom: how much freedom can we give developers? How much do we need to restrict to keep everything readable?
Viktor Frankl: the only thing you cannot take away from me is is the way I choose to respond to what you do to me. Freedom.
Bias. We often think with ourselves in the center of our universe. If you clap, you always hear a sound, right? Well, not in a vacuum. We also think we see everything correctly. But there's a blind spot in your eye: the area where there are no light-sensitive receptors as some connectors have to pass through on the way to the brain. The brain compensates for it by literally filling in the blank with bogus information...
We are biased. We have unrealistic expectations. How can we mitigate? A good start is to recognise it and to try to mitigate. Some strategies:
Funny! She was asked afterwards which line she would like to add to the Zen of Python :-) It was Good enough is good. (I think it was followed by a next line, ... but refactor afterwards).
]]>