2023-05-26
(One of my summaries of the May 2023 Dutch PyGrunn conference).
Sebastián Ramírez (“tiangolo”) has created fastAPI (and other open source projects). He talks about modern python. With modern python he means 3.7+.
One of the new items: f-strings, formatted strings. f"Hi {name}"
.
Type annotations, type hints. We’ve been working without type annotations all this time, so why should we add them? Well, for instance for great editor support. Suddenly the editor’s autocompletion is way better. If the editor knows some variable is a string, it can offer autocompletion for the variable by showing string’s methods.
You also get better errors in your editor when multiplying by something that
can also be a None
, for instance.
Type annotations are used a lot in fastAPI. @app.get("/recipes/{name}")
as
a decorator on a view function and a type “string” on the name
argument
and you get autocompletion. And type checking in the fastAPI api with
automatic input validation. And you get data conversion: a ?quantity=2
in
a URL is basically a string. But fastAPI knows it should be converted into an
integer.
Typer is a library for using the same functionality for command line interfaces. Including tab completion. A tip: use rich to make your interface nicer.
Another library: pydantic. It makes handling data models handier in fastAPI. Again, it uses those modern python type hints. It includes documentation generation.
And… another library. SQLModel. Pydantic classes, but then for SQL databases. (SQLalchemy does all the work behind the scenes).
Async-await concurrency. Python itself is fairly fast compared to network requests. So with regular web requests, you spend a lot of time waiting for I/O to complete. With asynchronous requests, the “waiting” can happen at the same time even when the CPU is busy doing all the work. You get your results much faster.
Async/await is not something that you should sprinkle everywhere in your code. You must be sure you put it in the correct spots where it actually is useful (and possible).
You can look at httpx, which allows you to use async where useful and regular handling where it is not needed without changing your code.
Note: throughout his presentation, Sebastián showed the names and pictures of contibutors to all the mentioned projects. Open source! Thanks for the work. You yourself can help, too. Translating pages, answering questions, helping with bugs/features. Often there are only a few maintainers, so all help is welcome.
2023-05-26
(One of my summaries of the May 2023 Dutch PyGrunn conference).
Arjan Egges runs the ArjanCodes youtube channel.
“Software design” often sounds like “huge diagrams with lots of arrows”. Sometimes software design is complicated, like determining what the customer actually wants. Or a big change in your mayor concept. Or fitting together completely different technologies, especially when they are opinionated.
What you also hear… “the code works, I don’t have time for refactoring”. “I have deadlines, and it works, so I don’t have to write tests”. “I just ask chatgpt to clean my code”.
His main point: software design is about consistently taking slightly different decisions while you’re coding. These decisions compound and should make you code faster and better.
He has seven tips for us:
Name wisely. This about categorising things. Being clear about what it is. Even better: a good name makes sure you yourself knows what something is :-) If you cannot find a good name for something, perhaps you should approach the problem in a different way? Perhaps you should turn your class into a couple of functions.
If a function has “and” in the name: perhaps it should be split into two separate functions.
If you use the word “library”, don’t use “lib” in other parts of the code. Be consistent.
Use the naming scheme of the language that you use.
Use type annotations. It is extra work. But as you need to be explicit, it also ensures you have a better understanding of your own code.
Injecting dependencies. It is easier to later change your code. When you pass in a function to do something, you can later chanage the argument to a slightly different function. Your code is extensible.
Composing your code in this way makes it easier to limit the amount of places where you tie everything together. It is also easier to test.
The core is that you should make the concious design decision to approach most problems in this way. Your code will be cleaner and nicer.
Default to functions. Limit the amount of classes that you use to just what’s needed.
Functions are easier to test than classes.
Don’t be greedy. If you only need a city name and a postal code, you don’t have to get an entire “Location” object with many other attributes. If you don’t need it, don’t request it.
Use abstraction. The abovementioned “Location” class could be abstracted into an “Address” class that has a couple of attributes less.
Python has “protocol classes” that are super-useful for abstractions. But way too few people know about those. (I also didn’t know about them).
Be finicky. It is OK to be precise about what your function accepts. Your tests will be simpler. You’ll have to a little bit more work where you call the function, but that’s fine.
Apply these tips to your code and see where they bring you.
2023-05-26
(One of my summaries of the May 2023 Dutch PyGrunn conference).
Single page applications (SPA) suck! Stop building them! (He worked on them for the last 12 years :-) )
The promises of SPAs: UI is a function of the state. A unified API for UI and native apps. Nice composability. Better long-term maintentance. And faster development because you can split in backend and frontend teams.
6 years later… 2.5MB minimized javascript. Page speed perfomance sucked.
The negative side of SPAs: Performance is bad. Complexity is through the roof. Display logic in your app logic. Business logic creeps into your clients. The codebase is split. And the tech split makes people feel helpless.
The backenders are excluded from the frontend so they just start to care about the API. The frontenders complain about the unusable API that doesn’t care about the frontend…
Back to jquery? Never. It takes only two years instead of six years for your code to go bad. Composability doesn’t even exist as a word in jquery. Logic spread out everywhere.
We need a modern interactive site. What do we actually want?
Composability.
Understandability.
Server-side html rendering and a minimum of javascript.
Empowered backend developers.
He demoed “twinspark”, a small javasript library that can replace parts of an html page from URLs. On click or on hover. He wrote the javascript in some 200 lines of code. So “hover over this element and then replace it by what this url returns”. Super-lightweight, super-simple. Composable: it is all just your own template.
He extended it a bit and it is now 2000 lines of code and 8kb minified.
They use it instead of their former SPA now, they’re a big Ukraine online retailer!. The html page is 2.5x smaller. The javascript 80x smaller. And suddenly the browser is fast! The old site made browsers weep.
So it is good for the frontend. Is it also good for the backend? Well, there’s no weird state management. Simple request/response model. And you can add interactivity without javascript.
He showed a couple of nice examples.
The downsides? The frontend developers hate being forced out of their job. And keeping the same API for web and apps is harder. Figuring out proper patterns takes some time: we’re very used to single page apps, it takes getting used to again. And if you have really complex multi-step behaviour: that’s easier to do in React.
But… in their big webshop, there’s only one complex multi-step process that’s hard. All the other parts, like the order form, are now simpler. It is approachable. It is fun. It is fast.
“htmx” is an alternative. It has a much bigger community (but was started after his project… and misses some features). “Unpoly” is an alternative, too. But it is much bigger and more complex.
You can also look at the “live view” mechanism: phoenix, laravel, rails. Websockets. In demoes, it is mind-blowing. But the behaviour under load is questionable. And state is kept in memory on the server: you’ll need lots of them.
2023-05-26
(One of my summaries of the May 2023 Dutch PyGrunn conference).
You might get the question “why do you use python, isn’t that a slow language”? Well, it is slower than many other languages, but not that slow. You should also look at programmer productivity. You wouldn’t write an operating system in python. You also wouldn’t write a web framework in C.
Python is dynamically typed instead of statically typed. This makes it easier for humans to understand and quicker. It makes sure you don’t have to deal with many low-level details. But it can cause errors.
Python is interpreted instead of compiled. If you compile code, it turns your program into machine code. The compiler has nothing further to do with it. With an interpreter, the interpreter is actually running your code when you run it. So the interpreter can handle all the OS-specific optimizations. “Live”. There’s byte code caching, so it’s not that it does double work.
In a compiled language, variables are stored in a fixed location in
memory. If you give a variable a new value, that new value is stored in the
same memory location. In python, everything is stored as a PyObject
with
a type
the value
a reference count
If you assign a new value to a variable name, a new PyObject is created. This can have a different “type”. The old PyObject isn’t referenced anymore, so the reference count is set to zero. It can then be garbage collected.
Python has to do more work, so it is less efficient than a compiled language. Creating PyObjects and names. It also takes quite some more memory due to the PyObject “wrapper”.
The notorious GIL
is the Global Interpreter Lock. It ensures thread safe
memory access: only one thread executes byte code at the same time. This is
especially needed to keep the reference count accurate.
GIL has an exception for I/O: that can happen concurrently. It is the CPU that governed by the GIL.
Important to keep in mind: python needs to work a bit harder to enable us to do less work: performance versus productivity. Some generic comments:
If you have a lot of I/O: threads. A lot of CPU: multiprocessing.
Watch out with loops where lots of PyObjects need to be created.
Use built-ins as much as possible. List comprehensions instead of a loop, for instance.
The same with numpy and pandas built-ins. Applying a lambda to all elements is waaaaaay slower than applying the function to the whole numpy array at the same time where it happens effectively inside numpy.
If a specific small piece of code is slow, you can try extending
python. “pyx”, compiled python code. cy_types
is compiled python with
extra type hints. You can also go to c or rust code.
Numba, dask, ray, pypy are alternatives you can look at.
2023-05-26
(One of my summaries of the May 2023 Dutch PyGrunn conference).
You might need files for testing. Real test files are perhaps not available when you need it. Synthetic data might just do fine for most use-cases. Real data might not even be allowed due to privacy concerns, for instance.
You could use Faker to generate fake names
and adresses and so for your test. You have control over what you
generate. faker.zip_code()
, faker.company_email()
. Faker helps when
you need to generate separate fields.
But sometime you need actual files. For that you can use faker-file. faker-file works with Faker and
factory_boy. It is added as a “faker provider”. It supports text, csv, docx,
mp3, png, pdf, epub, etc, etc. Also .eml
email files.
You can have it generate random text, but you can also pass sample text. You
can also pass a template: you can use Faker’s regular methods like
first_name
and address
in there. Handy!
If you generate a png, it will be a png with a bitmap of the text. A zipfile with some folders and docx files is also possible. And zipfiles with folders and zipfiles with pngs in them. So: basically everything.
Normally, the files are stored in some tempfile directory. You can also get the raw byte contents if you need to pass it as test data to an API method, for instance.
In case of Django: django needs the files to be inside its MEDIA
root,
otherwise you can get a “suspicious file error” exception. There’s support to
handle that.
Handy: you can also call faker-file on the command line.
2023-05-26
(One of my summaries of the May 2023 Dutch PyGrunn conference).
API on the fly? The talk is about autogenerating APIs using “Amsterdam Schema”. Jan works for datapunt Amsterdam, a project for being transparant as the city of Amsterdam. Your citizens pay taxes, so they have a right to access the open data that the municipality maintains.
They have api projects for many datasets. Like “BAG”, the official building database. Or a list of the power sockets for the weekly markets. Lots of geographical information.
They use postgres, docker containers, ubuntu, python + django + djangorestframework. Developed specific for this project: django gisserver.
DSO, digitaal stelsel omgevingswet, is a Dutch API specification for this kind of data. The various sorts of data are all specified in a json format that they call Amsterdam Schema. All data sets are available as geographical layers.
They have a custom django management command (import_schemas
--create-tables
) to read such a json schema file and create the empty tables
in the database. He demonstrated (live) how it works. The data was in the
dabase and a nice swagger-like page was shown in the website with the option
to browse the REST API. There’s even support for generating an SQL file with
migration instructions in case there’s a new version of the schema.
Geographical information is available as WFS, vector tiles and geojson.
The REST API contents are all read directly from the database. For debugging, there’s a possibility of generating a regular django database model.
The code is open source: https://github.com/amsterdam/dso-api
2023-05-26
(One of my summaries of the May 2023 Dutch PyGrunn conference).
Mere mortals? There are two kinds pf people in the world: those that like to optimize and those that are mere mortals :-) He’ll try to get us over to the optimization side.
Python is pretty slow. It is getting faster and faster all the time, though. And speed isn’t everything: readability and maintainability are also important. And a large community. Optimization at the language level will be talked about in the “python under the hood” talk later during this conference.
Something you often hear:
Premature optimizaton is bad. Worrying about efficiency in the wrong places and at the wrong times is a real problem accourding to Donald Knuth.
Micro-optimization is bad. But… is that so? A small part of your code might be called lots of times. Have you profiled it? Look at the ROI (return on investment). Time spend optimizing code that isn’t the actual problem is time that is wasted. Time spend on slow code that is called a lot, that’s a good thing.
What he’s suggesting is what he calls opportunistic optimization. It is a bit like the “boy scouts’ rule”: make the code a little bit better than when you found it. Passively keep an eye out for simple inprovements that you can do on the codebase. If you do something, it should have a significant performance improvement in the context of the piece of code you’re working on.
For this you’ll need to know your tech stack well enough to spot common
improvement possibilities. A good point to start are python’s build-in data
structures, use them a lot and know how to use them. These are implemented in
very efficient C code. Lists, sets, dicts, generators. List comprehensions are
often both more readable and much quicker than a for
loop. f-strings
instead of string concatenation. Data classes.
Some comments:
Most of the time, you won’t need to do anything, If you don’t spot a possible optimization, the code is probably OK.
Optimization is good, but don’t change the way the program works, don’t change the flow.
Don’t use dirty tricks and loopholes to gain a bit of performance.
Don’t compromise the readability of the code!
He showed a couple of clear examples. for
loops to list
comprehensions. frozenset
for filtering out duplicates from an unchanging
set. Not compiling a regex all the time, but doing it only once. from
functools import cache, cached_property
. from itertools import islice
.
Also look at third party libs. If you have large arrays, “numpy” will improve your performance a lot, for instance.
Code optimization is balancing scarce system resources (cpu, memory) with scarce developer time. “Opportunistic optimization” might be a good approach.
2023-05-26
(One of my summaries of the May 2023 Dutch PyGrunn conference).
Ivor works for dacom, a GIS and agri company. They use both django and fastAIP.
FastAPI is a relatively new web framework for building REST APIs. The ingredients are pydantic, data validation based on python type hints, and starlette, an asgi async toolkit.
With pydantic you can define classes/models of your data. With python type hints. Pydantic takes care of the type checking and also converts the input to the correct type. With Django, you’ll get the correct type out of the database, but it won’t automatically fix up the model you’re instantiating yourself.
He showed a simple django app with two models (“Farm” and “Field”). Right next
to it he generated a fastAPI app. In the asgi.py
file he hooked up django
as usual and added a fastAPI app to the router. So django and fastapi run side
by side. He showed the django admin, running as usual, and /api/
with the
fastAPI functionality.
He uses fastapi-utils for a
“DTO”, data transfer object. fastapi-models has an APIModel
class to map
pydantic models to models in some other system. In this case django’s
ORM. Seemed to work like a charm.
But… why would you? You get a consistent API. automatic documentation and validation. Automatic api style and consistent naming. No undocumented attributes.
In 2011 he used tastypie. Is that still maintained? At the time, users would look weirdly at an API: “can’t I use ftp and csv files”? Since 2013 he started using djangorestframework. Nowadays, users ask for the swagger endpoint: they’re used to APIs and want a clean and modern one. The world has changed. FastAPI fits this new world a bit better.
The exposed API should be intuitive. The api should be stable. And useful for front-end teams and third parties. Your database model is not your REST API, so really design your UI and REST API, it is not an afterthought. Make a beautiful and logical API out of it. So a separately defined API is a bit of extra work, but it is worth the effort.
Why not djangorestframework? Well, it is very well integrated with Django. He still uses is for multiple projects. But he is starting to like fastAPI. Another alternative might be Django Ninja, which is inspired by fastAPI. It might become something nice, but is less mature at the moment.
2023-05-24
Hurray, this friday (23 May) there’s the 11th installment of the Dutch PyGrunn conference. Time to wake up again at an early hour to grab the train to the north of the Netherlands :-) Oh, and to dig out a historic legendary pygrunn t-shirt (I’ve chosen the 2012 one).
I really enjoy these conferences. There is a nice mix of talks to choose from, from python inner details to how-we-use-it. Here are four that I’m looking forward to:
Combining fastAPI and Django. At our company we’re using fastAPI for a new project, but all our existing projects are in Django. I hope to get a better feel for the (im)possibilities. Personally I haven’t used fastAPI, so a conference like pygrunn helps a to broaden my knowledge. Sebastián Ramírez, the creator of fastAPI, gives the closing talk so that’ll be a good source of information.
And… Sebastián’s talk might also be a good source of a “mental model”, a
way of thinking about software and software design and projects. He’ll talk
about how to make code that is less error-prone, simpler, more efficient,
and have a great developer experience, all at the same time. And all this
while including best practices by default. I have the feeling that I could
“up my game” quite a bit. I’m not using type hints everywhere. I really
need to understand async/await
, especially how to use that in an
understandable way. Anyway, that’s what such a conference is for, too: to
influence us and to change the way we think.
Jan Murre’s “your API on the fly” talk about how to generate APIs for Amsterdam’s many sets of open data. APIs filled through Airflow. We’re experimenting with Airflow, so some info from practical experience will come in handy. And APIs generated from custom schemas sounds interesting on its own.
Frontend for backenders. Yeah. I’m one of those dinosaurs: the last time I touched javascript is probably 8 years ago. A bit of templating and a bit of css is all I normally use for a frontend. Most of our frontends are nice React single page apps talking to a backenders-created REST API. So personally I’m very interested in what’s possible for backenders :-)
I always make summaries of the talks I attend, I’ll update the text above with links to the talks once the conference is over.
Oh, I’ve used “we” and “our” above. That’s Nelen & Schuurmans, a water/climate/data/python company in Utrecht (NL). “Data driven water management” is the tagline, which fits quite well. There’s a vacancy right now, which I’m sneaking into this generic blog post :-) Our office is 5 minutes walk from Utrecht central station, right in the nice old center of Utrecht… There’s lots to choose from: open source Qgis plugins, React, Django, fastAPI, Ansible, hard-core Fortran (!) hydrology, kubernetes, postgis. And you’re helping keep the Netherlands (and parts of Australia, Germany, Vietnam, etc) dry where it should be dry and wet where it should be wet.
You can talk to me at the conference for more info. There’s a 2 minute video of me explaining the company, I should still be recognizable even though I’ve lost some 12kg since making that video :-)
2023-05-10
Jan Gazda is one of the organizers of the PyAmsterdam meetup. He used to be active in the Czech open source python scene.
Jupyter notebooks: interactive, many languages, markdown support. Nice plotting. Handy for exploration.
Jan once helped a data scientist with his python notebook. One of the first
things he did was ask how direct am I allowed to be in my feedback? Which
was necessary, as there was os.system("sudo mkdir /opt/output")
and
os.system("sudo chmod 777 /tmp/out")
in the notebook…. Making
directories readable/writeable for everyone on an open server is not a good
idea.
The code did its job, but needed some work. Some of the necessary improvements they both saw:
Deployment.
Tests/linting.
Automatic execution.
Configuration.
Logging.
Monitoring/alerting.
Many of the improvements were done by creating a command line application to manage it all (not open source yet, btw). It mostly uses existing building blocks:
“Poetry” for python virtualenv management and nice, clear dependency
resolution. pyproject.toml
support. Showing a tree of the dependencies
with poetry show --trees
is especially handy. poetry show --latest
shows whether there are updates to your packages.
“click” (https://click.palletsprojects.com) for building your commandline interface. It is the handiest library for your commandline. Automatic help page generation and so.
https://typed-settings.readthedocs.io for reading config files from file or
environment variables. A handy characteristic is that the settings are all
typed, so you know for sure that something ends up as a 2
integer
instead of as a "2"
string. It has integration for the abovementioned
“click”.
https://www.structlog.org . Python’s logging is already pretty good, but structlog is faster. And the console logging is much nicer with better tracebacks and color support. Spotting what happens is more pleasurable. And… you can configure it to output json format logs.
One thing he discovered during the process was https://papermill.readthedocs.io/ , a nice way to parametrize notebooks. You execute notebooks and you can set parameters that have defaults in some notebook cell tagged “parameters”. And you can pass along input and output notebooks.
The output file is the original notebook, but with the output of all the
cells and also new cells with replaced parameters (if you used
those). Input and output notebooks can be stored almost anywhere. You can
use s3://
urls for instance.
Actually running the notebooks, in their case, happened in the google cloud. With json logging going to the “cloud logger” and then on to sentry or datadog.
He’s part of https://datavaluepeople.com, machine learning experts. Lots of data science, data engineering and devops.
Data science is nice, but if you have lots of project, you have a lot of repetetive work on your hands installing and maintaining it all. Efficiency of delivery, resilience of the systems. Duplication of code, multiple repositories, hard to share code. Can you automate or group that in some way?
One part of the solution is to use a monorepo that contains all code for all data science projects in one git repository. Your project-independent utility code and automation code can be shared. They use a mono repo with “projects”, “lib” and “tools” subdirectories.
Projects contain all the projects with their dockerfile and pyproject.toml.
Lib has the project-independent python code.
Tools are the install and management tools and scripts.
Code quality makes a big difference. Maintainable code pays a big divident.
Another part of the solution is containerisation. Essential to solve the
“but it works on my machine” or “it used to work on my machine”
problem. Dockerfile
for the win.
Another thing to figure out: your compute solution. Kubernetes? VMs? Serverless (fargate, lambda)?
Next one: storage. What they’re using is google cloud and s3. You store all the output and also the intermediary results in such a bucket and pass the URLs along between your jobs. tentaclio can help with accessing the various storage options. Timestamping the buckets can help.
For setting up the cloud infrastructure, “terraform” is a good idea. Infrastructure-as-code. (They have this in a separate repo, not in the monorepo, btw.)
They use CI/CD to keep the code clean and the dockerfiles working and automatically building.
Also needed: orchestration. Pipelines, input/output, job scheduling, logging. Something like airflow or prefect.
Some closing comments about notebooks in general:
Handy to quickly get visualisations.
Combined with papermill and parameters: great debugging.
Better for building models/analysis than data transformation.
Can get in the way of refactoring/automating.
Statistics: charts of posts per year and per month.
My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):