2025-03-27
(One of my summaries of the second Python Leiden (NL) meetup in Leiden, NL).
He’s working with civil engineering, hydrology and weather data. And… he wanted to toy with 3D models. His example site: https://topography.jessekv.com/ . You can click almost anywhere in the world and get the upstream catchment. (I checked it: yes, it seems to work pretty well!)
He runs all requests through a single asyncio python thread. As a personal challenge he wanted it to handle the heavy load of a hacker news post. In the end, it worked fine. One async python thread was more than enough.
One of the tricks he used was to preprocess as much as reasonable so that most clicks are direct lookups in a database (vector data). Depending on the size of the selected area, he uses more detailed rasters for small areas and coarser ones for big areas.
He wanted a mostly-working prototype quickly, so he experimented with LLMs. Generating math code was mostly bad, but the UI code was OK.
He used duckdb with a spatial extension. Duckdb uses GDAL vector routines. This is what he used to pre-process the catchment areas on his M1 mac laptop. Afterwards, he exported it to postgres. Postgres is much more optimised for actual production use.
Duckdb doesn’t always work perfectly, but if you’re able to define your workload in such a way (parallelised) that you stay within the limits of your memory, you can get real good performance out of it.
Duckdb’s file-based approach is also handy. Just like sqlite’s files. Easy for experimenting.
zarr is what he used for pre-processing the 3D landscape. Zarr is
efficient for storing large arrays of gridded data. Zarr is way nicer than
netcdf
. It is designed to leverage the linux page cache. And you store compressed
data in memory. Storing on S3 is also well-integrated.
jax is an easy way to take numpy/scipy to make it
run in parallel. JAX-metal
is a jax backend that runs on his M1 macbook’s
GPU. Processing is aligned to chuncks for more efficient reads and writes.
For landcover, he used jax.scipy.stats.mode
and for elevation
jax.numpy.nanmean
. (NaN
: not-a-number: elevation models are made with radar
technology and water areas reflect radar, resulting in NaN).
A useful trick he used was to introduce a bit of wait time for some expensive operations to make sure his service wouldn’t get flooded with requests. Simply waiting a few seconds and then popping up a dialog “you’re going to download 400MB, y/n?” already helped.
2025-03-27
(One of my summaries of the second Python Leiden (NL) meetup in Leiden, NL).
uv is the new python packaging solution that everybody
should be using. He demoed it in a standard demo django wagtail project that still
had a requirements.txt
. Creating a virtualenv and doing a pip install
worked,
but it took a bit of time.
Next, he tried the venv/pip compatibility layer of uv
. So uv venv
and uv pip
install -r requirements.txt
. Oh… python -m venv .venv
took 3 seconds, and uv
venv
0.04 seconds. Wow. The uv pip
was also much faster. Wow.
“Faster” is nice in development, but also when running your test in a CI enviroment (like github actions).
uv
can also manage your python installations. It downloads a stand-alone python for
you when needed, for instance if you need a version you don’t have locally.
Lastly, he added a pyproject.toml
and put the dependencies from requirements.txt
into the pyproject.toml instead. Calling uv run manage.py
automatically activates
the virtualenv, install everything and run manage.py just like you’d have called
python manage.py
.
Installing it in such a way creates an uv.lock
file with all the pinned packages
just such as uv downloaded them. The file should be checked in. The dependencies only
say “I want whitenoise
and don’t care what version”, but the uv.lock
tells all
your colleages (or your server..) which version you’ve installed, keeping everything in
sync.
Dependency scanners also support uv.lock
. At least, “renovate” already supports it
and “dependabot” support is underway.
Handy: uv tool install your-tool
for installing some tool like “pre-commit” in its
own virtualenv somewhere and make it available globally. uvx your-tool
installs the
tool temporarily for just one command. And with uv add --script your-script.py
some-dependency
you can add dependencies to single-file scripts. Running the script
with uv run your-script.py
installs those dependencies for the duration of the
script.
2025-03-27
(One of my summaries of the second Python Leiden (NL) meetup in Leiden, NL).
Nice subtitle for the talk: “python serialisation: this ain’t your father’s cruesli”… :-)
He wants to show us how dangerous it is to de-serialize content provided by someone else. His examples are at https://github.com/airza/deserialization_labs
Serialisation: converting a data structure living in memory into data you can store on
disk or send over. Deserialisation is converting it back into a python object. There are
interoperable formats like json and xml. Most languages have their own specific
methods: python has pickle
.
Serialising a dict or list is often easy. json.dumps({"some": "structure"})
But what
if you’ve got some non-standard data structure like a python object? json serialisation
won’t work out of the box. And if you’ve got huge data structures, json (being
human-readable) is slow and huge.
Pickle stores python objects in some binary format on disk. Fun fact: pickle was added to python in 1995, json only exists since 2006. I’ll paste one of his examples to make clear how picle works:
import pickle
from tsukimi import Tsukimi
cat = Tsukimi("Fluffy", "Empty")
pickle.dump(cat, open("tsukimi.pickle", "wb"))
Deserialising works like this:
import pickle
cat = pickle.load(open('tsukimi.pickle', 'rb'))
print(cat.fur)
print(cat.brain)
Pickle just stores the name of the class it needs to re-create plus the contents of the attributes. So not the methods or so.
Pickle is explained here: https://docs.python.org/3/library/pickle.html . It has a nice warning right at the top: Warning: The pickle module is not secure. Only unpickle data you trust.
Pickle stores all attributes by default, if you don’t want that you can define a
special __reduce__()
function that specifies just the attributes you want and the
name of the class that can restore them. But… the name of that class is just looked
up, there’s no validation. So you can also pass something that’s not the name of your
class, but something like os.system
which just calls anything on the command
line…:
import os
import pickle
class EvilCat:
def __reduce__(self):
return os.system, ('export > version.txt',)
evil = EvilTsukimi()
pickle.dump(evil, open("evil.pickle", "wb"))
If the code that loads this pickle reads the version.txt (as in the exercise that he had us run), you suddenly see all the server’s environ variables.
So: never let people give you pickles. Use json for user input. Or protobuf.
Pytorch (a pydata library) uses pickles. They recently started overwriting the unpickler’s functionality, but he showed some ways to get around its “limitations”.
He recommended looking at https://github.com/b4rdia/HackTricks/tree/master/generic-methodologies-and-resources/python/bypass-python-sandboxes
2025-01-27
(One of my summaries of the first Python Leiden (NL) meetup in Leiden, NL).
Tobias studied applied mathematics at Delft University.
One of the fields he used python for was graph theory. A graph consists of points (“vertices”) connected by lines (“edges”). It is a large field with many real world projects like social networks and logistics. He showed a demo he made with networkx, a python library that makes it real easy to do these kinds of graph calculations.
Graphs need to be shown. He used pyviz for that by converting the networkx graph to the format understood by pyviz.
Another field is machine learning. He did an experiment with a simulated self-driving car. He used a library that handles the basics like “reacting to a closed line on the road” and “measuring the distance to the dashed line on the road”. The simulation is shown in a visual form, which makes it funny to look at.
In his study, python was also handy for statistics and numerical analysis.
2025-01-27
(One of my summaries of the first Python Leiden (NL) meetup in Leiden, NL).
FawltyDeps is a python dependency checker. “Finding undeclared and unused dependencies in your notebooks and projects”.
Note by Reinout: since 2009 I’m one of the maintainers of z3c.dependencychecker…. also a python dependency checker :-) So this talk interested me a lot, as I didn’t know yet about fawltydeps.
A big problem in science is the “replication crisis”. Lots of research cannot actually be reproduced when you try it… Data science is part of this problem. Reproducing your jupyter notebook for instance.
Someone looked at 22k+ jupyter notebooks. Only 70% had declared their dependencies, 46%
could actually install the dependencies and only 5% actually could be
run. ModuleNotFoundError
and ImportError
were the number 1 and 3 in the list of
exceptions!
What is a dependency? For instance “numpy”, if you have a import numpy as np
in your
file. Numpy isn’t in the python standard library, you have to install it first.
You can specify dependencies in setup.py
, pyproject.toml
, requirements.txt
and so. If you import something and don’t specify it, it is an “undeclared
dependency”. When you later on remove an import and don’t adjust your
requirements.txt
, you have an “unused dependency”. That’s not immediately fatal, but
it might take up unnecessary space.
FawltyDeps was started to help with this problem: find undeclared and unused dependencies. It reports them. You can ask for a more detailed report with line numbers where the dependencies were found.
FawltyDeps supports most dependency declaration locations. requirements.txt, setup.py,
pyproject, conda, etc. And it works with plain python files, notebooks, most python
versions and most OSs. You can configure it on the commandline and in config
files. There’s even a handy command to add an example config to your pyproject.toml
.
Handy: you can add it as a pre-commit hook (https://pre-commit.com). And: there’s a ready-made github action for it, including good reporting.
Fawltydeps has to deal with several corner cases:
Package names that don’t match what you import. import sklearn
and the dependency
scikit-learn
.
Or setuptools
that provides both setuptools
and pkg_resources
.
For this it looks at various locations for installed packages to help figure out those mappings. It helps if you’ve installed FawltyDeps in your project’s virtualenv.
You can add your own custom mappings in your configuration to help FawltyDeps.
You can exclude directories.
There’s a default list of “tool” packages that FawltyDeps doesn’t complain about if you include them as dependency without importing them. Ruff, black, isort: those kinds of tools.
Django projects can have dependencies that aren’t actually imported. You can ignore those in the config to prevent them to be imported.
At the moment, extra dependencies (like [test]
or [dev]
dependencies) are just
handled as part of the whole set of dependencies.
2025-01-16
Important things first: 27 January there’s a python meetup in Leiden (NL) of the new python Leiden user group.
Meetup groups come and go, often depending on one or two people or on a company that organises it. And yes, meetup.com has basically cornered the market, at least in my experience.
There used to be a “PUN”, python usergroup Nederland, meeting that would be held in various cities, depending on the company that hosted it in turn. (For those in NL: Den Haag, Zoetermeer, Rotterdam, Utrecht, Veenendaal, Arnhem, Amsterdam, …). Managed via a mailinglist, as meetup.com didn’t exist yet. Later lots of python and/or django meetup.com-based-meetups were organised. To me, it felt a bit weird that all of them seemed to be city-oriented. Amsterdam python meetup (3 different ones), Amsterdam django meetup, Utrecht (2x), Eindhoven, Rotterdam. I went to many of them, mostly it is just an hour of travel by public transport :-)
I like going to those meetups. You get a feel for what people are doing with python. You get ideas. You learn about libaries (sometimes even from the standard library) that you didn’t know about. New python tricks. For me, it is a great method to keep up-to-date on what’s possible and on what people are enthousiastic about.
At the moment, the number of python meetups in the Netherlands seems a bit low. Perhaps I’m missing something? (I see there’s a pydata one in Amsterdam that I missed.) The last two I attended were the nice PyUtrecht ones. So: I’ll be attending the Leiden one :-)
(Note: I’m talking about meetups, we’re blessed with two one-day python conferences in the Netherlands. pygrunn in May and pycon NL in October.)
2024-10-10
(One of my summaries of the one-day Pycon NL conference in Utrecht, NL).
Full title: Events in fintech: from state machines to event-sourced systems.
He works at kiwi.com, a payment processor. They recently refactored their systems to be more event-based.
Previously, they used state machines. A payment/customer/whatever can be in several well-defined states. Between states (but not necessarily between all states) there are transitions.
Advantages of state machines: they’re simple. Easy to understand. They’re efficient. Easy to test: just states and transitions.
Disadvantages: there’s no inherent history. That’s sometimes bad, as it is hard to answer “why is my payment in this state?!?” There’s also lack of context. And changing a state machine can be hard.
Some problems they encountered: race conditions. Two states are started for the same account. One gets a +15, the other a -20. As the states don’t know about each other, the resulting account balance can be incorrect.
Now on to events. Event sourcing / event driven architecture is what he called it. You start with a command “withdraw money, amount = 15”. This gets placed in a queue and gets processed. The processing results in another event “15 has been removed” that gets send to the database.
Events are immutable facts about things that have happened in the system. They are always in the past tense and include all relevant data. And: avoid internal implemention details if possible.
“Event sourcing”: you can re-construct the current state of the system by looking at all the events. The current state has a source: the accumulation of all events. Great for having a complete audit trail. You can also reconstruct past states (“you have programmatic access to the past” :-) ). You also have the possibility to actually solve race conditions that occurred.
If you have the time, you can even go towards the CQRS pattern (command query responsibility segregation).
There are challenges of event sourcing. Eventual consistency: if you encounter a problem, you can fix it. But the fix changes history, so the system is “eventual consistent”, not “really really always consistent”. Also: there are higher storage requirements. Complexity also is higher.
The challenges have possible solutions. The “saga pattern” has start and end events. You could also try some locking mechanism. An “optimistic concurrency control”. Not always possible, but handy: “idempotent events”, events that you can apply a couple of times after each other without doing any harm.
2024-10-10
(One of my summaries of the one-day Pycon NL conference in Utrecht, NL).
In 2024, what is a fullstack python dev? Well, python. Database administrator. Integrator with external services. A bit of terraform or kubernetes. AWS clicky-clicky expert. And a bit of frontend.
So the frontend is only a small part of your job. You have a package.json
that makes
no sense to you, so you just npm install
it and download the entire internet into
your local node_modules/
folder. But somehow it seems to work.
The frontend? You have frameworks. More frameworks. Meta-frameworks that framework your frameworks. Frameworks don’t make things simpler, they just move the complexity somewhere else. If you use django’s ORM, you don’t have to deal with the complexity of SQL. But in exchange you have to learn the django ORM.
He wants to look at three layers:
Markup / html.
Styling / css / design systems.
Interactivity / javascript.
Markup and html. On one hand you can have bindings. “Fasthtml”, for instance. A
mapping from python to html. return Div(H1("hurray"))
. You just move the complexity.
Or “native web components”. You have custom <star-rating>
elements that get
translated into actual html. You need specific javascript code for this, so it isn’t
really portable between frameworks.
Another alternative: templating. Jinja2
is used in most programming languages. You
can do some templating, but it quickly gets unreadable.
All these solutions are great in their own way, but also suck in their own way.
Styling/css. This is an area that actually saw a lot of improvements in the last years! CSS now supports variables out of the box, so no need for “sass” or so anymore.
You used to use bootstrap, jquery and a few other things and try to style your divs and
spans. You don’t need to do that anymore: there is more than just span and div
nowadays. Classless: you use html’s new elements such as <article>
and get
something not-too-bad out of the box. You don’t use custom class statemens anymore.
CSS has its own utility frameworks now, like tailwind. He dislikes tailwind (just use a style on your element…). For the rest, css seems in pretty good shape.
Interactivity/javascript. Javascript used to be essential for things like hovers and tooltips. But: that’s build into html/css now! No need for javascript for this.
You could look at web assembly. https://pyscript.net/ for running python in the browser. Nice. But you need to know both the internal browser API and the webassembly bindings and get python to run… He looks at this field about once every half year to see if it is ready for normal use.
HTMX could be nice. https://htmx.org/ . Interactivity for your html page with auto-replacing of certain parts of the page without you needing to do javascript. It is pretty popular, but he found lots of the functionality pretty hard to use. After two years, he found out he used only one part of it most of the time. So he wrote some small javascript thingy to do just one simple kind of replacement.
Interactivity: most of it sucks.
Summary: there is no one silver bullet for your project. In many cases you’re going to benefit from building something yourself. So: if there’s no silver bullet, just produce a lot of regular bullets. Small custom self-made one-offs.
2024-10-10
(One of my summaries of the one-day Pycon NL conference in Utrecht, NL).
Full title: efficient python project setup: showing cookiecutter’s potential within Kedro.
Kedro: https://kedro.org/, “a toolbox for production-ready data science”. Open source, python. It helps you apply regular software engineering principles to data science code, making it easier to go from prototype to production.
Things like Jupyter notebooks are great for experimenting, but not nice when you throw it over the wall to some programmer to clean it up and convert it to “real” code.
Kedro consists of:
Project template. This is done with cookiecutter.
Data catalog. Core declarative IO abstraction layer.
Nodes + pipelines.
Experiment tracking.
Extensibility.
Cookiecutter: https://cookiecutter.readthedocs.io/ . You use cookiecutter (the program) to create projects from “cookiecutter templates”. Such a template gives you a repository structure out of the box, filled in with some parameters that you provide like the name of the project.
Cookiecutter reads a settings file and prompts you interactively with some variables it
wants you to provide. It then reads a directory structure and generates an output
directory based on it. Really handy, as you normally get a README, some
pyproject.toml
or so, a proper directory structure, perhaps a sample test file.
The alternative is to start with an empty directory. Does the data scientist know or care about a README? Or how to set up a python project? It is much better to provide a handy starting point out-of-the-box.
There was a love/hate relationship with the Kedro cookiecutter templates. The templates were pretty complete, but the completeness meant that there was actually a lot of code in there: steep learning curve and lots of boilerplate. Documentation generation, for instance, which isn’t always needed.
They then made a second version that asked a few more questions and limited the amount
of generated code, based on the answers. For this customization they used the “hooks”
that cookiecutter provides: pre_prompt
, pre_gen_project
,
post_gen_project
. pre_gen_project
can adjust the filled-in variables before
actually generating the code. post_gen_project
can be used to adjust the code after
generating.
With some if/else and some post_gen_project
cleanup of the generated code, they were
able to limit the amount of generated unnecessary code.
So… use cookiecutter! A great way to help colleagues and users get started in an easy and right way.
2024-10-10
(One of my summaries of the one-day Pycon NL conference in Utrecht, NL).
Localisation and translation of programming language: how to make programming languages more inclusive and why that matters. Felienne is professor at the vrije universiteit in Amsterdam. She also works at a school teaching 12 year olds. Those 12 year olds wanted to learn programming and she agreed to teach it.
When she herself learned programming, it was in the 1980’s without any adults in her
environment to teach her. So she typed over some Basic
code from a book and learned
herself. That’s how she learned programming. The compiler was the teacher and she
learned to read error messages.
But current 12 year olds don’t learn that way:
> print("Hello world") # <= Note the extra space in front of print
^^^ IndentationError
“Teacher, what is an indentationerror?”. Or rather in Dutch “juf, wat betekent een indentation error”. So she tried to write a much simpler language to teach the kids. Simple code. “Lark” translates it to proper syntax. This is then converted to an “AST”, abstract syntax tree. Which is then converted to python.
A request that came up quickly was if the keywords could also be in Dutch. So not “ask” but “vraag”. She found it weird as Dutch kids are supposed to be good in English. But yeah, it was a natural question and she got it working. Even with mixing langauges at the same time.
Then the next request came from someone from Palestine. Couldn’t she make a version for Arabic? Right-to-left language… And what about variable names? Then she started to look up the definition. A combination of underscores, lowercase and uppercase characters. Oh. It didn’t include the accented characters of many European languages. And most especially Arabic characters as those have no upper/lowercase…
Right-to-left: not everything is hard. In right-to-left, the first character your computer gives you is the rightmost character. Easy. And even if you use (brackets), what looks like a closing bracket on the right is actually an opening bracket: fonts solve this!
It does depend on the program/editor that renders your code or output, though. Especially when mixing languages, you can get weird results. She showed a python traceback where a RTL string was showed as LTR.
Our 0123456789
numbers are Arabic numbers, right? As apart from the Roman I
,
II
, III
, IV
, V
? Well, actually Arabic uses different numbers! Why don’t
we learn this instead of Roman numerals? ١
, ٢
, ٣
. (Note: I hope this
renders correctly. My editor (emacs) is doing funny (but probably correct!) thingies
moving the cursor as it recognises the right-to-left pasted characters).
It is epistemic injustice. Epistemic is somehting like “the study of knowledge”. In this case she means that loads of people are done injustice as their numbers are not allowed. She showed an Arabic “1+1” in many programming languages with the syntax errors they result in. Loads of people in the world are basically discriminated against because a small group of western people designed programming langauges in a specific way.
Well, does it work? Does it work to teach programming using such a localised, translated langauge? Yes, it does. They tested it in Botswana on bilingual kids (most of the kids there speak English in addition to the local language). The kids using the localised, translated language learned more and used it more. It was easier to understand concepts.
It should also be kept in mind that English, in many countries, is the language of either a former coloniser or oppressor or the country that bombed them. What message are you sending if you effectively say that you have to use the English language when you go into IT?
Statistics: charts of posts per year and per month.
My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):