Pycon NL: Efficient python project setup with cookiecutter - Merel Theisen

Tags: pycon, python

(One of my summaries of the one-day Pycon NL conference in Utrecht, NL).

Full title: efficient python project setup: showing cookiecutter’s potential within Kedro.

Kedro: https://kedro.org/, “a toolbox for production-ready data science”. Open source, python. It helps you apply regular software engineering principles to data science code, making it easier to go from prototype to production.

Things like Jupyter notebooks are great for experimenting, but not nice when you throw it over the wall to some programmer to clean it up and convert it to “real” code.

Kedro consists of:

  • Project template. This is done with cookiecutter.

  • Data catalog. Core declarative IO abstraction layer.

  • Nodes + pipelines.

  • Experiment tracking.

  • Extensibility.

Cookiecutter: https://cookiecutter.readthedocs.io/ . You use cookiecutter (the program) to create projects from “cookiecutter templates”. Such a template gives you a repository structure out of the box, filled in with some parameters that you provide like the name of the project.

Cookiecutter reads a settings file and prompts you interactively with some variables it wants you to provide. It then reads a directory structure and generates an output directory based on it. Really handy, as you normally get a README, some pyproject.toml or so, a proper directory structure, perhaps a sample test file.

The alternative is to start with an empty directory. Does the data scientist know or care about a README? Or how to set up a python project? It is much better to provide a handy starting point out-of-the-box.

There was a love/hate relationship with the Kedro cookiecutter templates. The templates were pretty complete, but the completeness meant that there was actually a lot of code in there: steep learning curve and lots of boilerplate. Documentation generation, for instance, which isn’t always needed.

They then made a second version that asked a few more questions and limited the amount of generated code, based on the answers. For this customization they used the “hooks” that cookiecutter provides: pre_prompt, pre_gen_project, post_gen_project. pre_gen_project can adjust the filled-in variables before actually generating the code. post_gen_project can be used to adjust the code after generating.

With some if/else and some post_gen_project cleanup of the generated code, they were able to limit the amount of generated unnecessary code.

So… use cookiecutter! A great way to help colleagues and users get started in an easy and right way.

 
vanrees.org logo

About me

My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):