PyAmsterdam May 2023 meetup

Jupyter notebooks in production - Jan Gazda

Jan Gazda is one of the organizers of the PyAmsterdam meetup. He used to be active in the Czech open source python scene.

Jupyter notebooks: interactive, many languages, markdown support. Nice plotting. Handy for exploration.

Jan once helped a data scientist with his python notebook. One of the first things he did was ask how direct am I allowed to be in my feedback? Which was necessary, as there was os.system("sudo mkdir /opt/output") and os.system("sudo chmod 777 /tmp/out") in the notebook…. Making directories readable/writeable for everyone on an open server is not a good idea.

The code did its job, but needed some work. Some of the necessary improvements they both saw:

  • Deployment.

  • Tests/linting.

  • Automatic execution.

  • Configuration.

  • Logging.

  • Monitoring/alerting.

Many of the improvements were done by creating a command line application to manage it all (not open source yet, btw). It mostly uses existing building blocks:

  • “Poetry” for python virtualenv management and nice, clear dependency resolution. pyproject.toml support. Showing a tree of the dependencies with poetry show --trees is especially handy. poetry show --latest shows whether there are updates to your packages.

  • “click” (https://click.palletsprojects.com) for building your commandline interface. It is the handiest library for your commandline. Automatic help page generation and so.

  • https://typed-settings.readthedocs.io for reading config files from file or environment variables. A handy characteristic is that the settings are all typed, so you know for sure that something ends up as a 2 integer instead of as a "2" string. It has integration for the abovementioned “click”.

  • https://www.structlog.org . Python’s logging is already pretty good, but structlog is faster. And the console logging is much nicer with better tracebacks and color support. Spotting what happens is more pleasurable. And… you can configure it to output json format logs.

  • One thing he discovered during the process was https://papermill.readthedocs.io/ , a nice way to parametrize notebooks. You execute notebooks and you can set parameters that have defaults in some notebook cell tagged “parameters”. And you can pass along input and output notebooks.

    The output file is the original notebook, but with the output of all the cells and also new cells with replaced parameters (if you used those). Input and output notebooks can be stored almost anywhere. You can use s3:// urls for instance.

Actually running the notebooks, in their case, happened in the google cloud. With json logging going to the “cloud logger” and then on to sentry or datadog.

Building a minimal data science platform - Benjamin Cerigo

He’s part of https://datavaluepeople.com, machine learning experts. Lots of data science, data engineering and devops.

Data science is nice, but if you have lots of project, you have a lot of repetetive work on your hands installing and maintaining it all. Efficiency of delivery, resilience of the systems. Duplication of code, multiple repositories, hard to share code. Can you automate or group that in some way?

One part of the solution is to use a monorepo that contains all code for all data science projects in one git repository. Your project-independent utility code and automation code can be shared. They use a mono repo with “projects”, “lib” and “tools” subdirectories.

  • Projects contain all the projects with their dockerfile and pyproject.toml.

  • Lib has the project-independent python code.

  • Tools are the install and management tools and scripts.

Code quality makes a big difference. Maintainable code pays a big divident.

Another part of the solution is containerisation. Essential to solve the “but it works on my machine” or “it used to work on my machine” problem. Dockerfile for the win.

Another thing to figure out: your compute solution. Kubernetes? VMs? Serverless (fargate, lambda)?

Next one: storage. What they’re using is google cloud and s3. You store all the output and also the intermediary results in such a bucket and pass the URLs along between your jobs. tentaclio can help with accessing the various storage options. Timestamping the buckets can help.

For setting up the cloud infrastructure, “terraform” is a good idea. Infrastructure-as-code. (They have this in a separate repo, not in the monorepo, btw.)

They use CI/CD to keep the code clean and the dockerfiles working and automatically building.

Also needed: orchestration. Pipelines, input/output, job scheduling, logging. Something like airflow or prefect.

Some closing comments about notebooks in general:

  • Handy to quickly get visualisations.

  • Combined with papermill and parameters: great debugging.

  • Better for building models/analysis than data transformation.

  • Can get in the way of refactoring/automating.

 
vanrees.org logo

Reinout van Rees

My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):