Pycon NL: Efficient python project setup with cookiecutter - Merel Theisen
Tags: pycon, python
(One of my summaries of
the one-day Pycon NL conference in
Utrecht, NL).
Full title: efficient python project setup: showing cookiecutter’s potential within Kedro.
Kedro: https://kedro.org/, “a toolbox for production-ready data science”. Open source,
python. It helps you apply regular software engineering principles to data science code,
making it easier to go from prototype to production.
Things like Jupyter notebooks are great for experimenting, but not nice when you throw
it over the wall to some programmer to clean it up and convert it to “real” code.
Kedro consists of:
Cookiecutter: https://cookiecutter.readthedocs.io/ . You use cookiecutter (the program)
to create projects from “cookiecutter templates”. Such a template gives you a repository
structure out of the box, filled in with some parameters that you provide like the name
of the project.
Cookiecutter reads a settings file and prompts you interactively with some variables it
wants you to provide. It then reads a directory structure and generates an output
directory based on it. Really handy, as you normally get a README, some
pyproject.toml
or so, a proper directory structure, perhaps a sample test file.
The alternative is to start with an empty directory. Does the data scientist know or
care about a README? Or how to set up a python project? It is much better to provide a
handy starting point out-of-the-box.
There was a love/hate relationship with the Kedro cookiecutter templates. The templates
were pretty complete, but the completeness meant that there was actually a lot of code
in there: steep learning curve and lots of boilerplate. Documentation generation, for
instance, which isn’t always needed.
They then made a second version that asked a few more questions and limited the amount
of generated code, based on the answers. For this customization they used the “hooks”
that cookiecutter provides: pre_prompt
, pre_gen_project
,
post_gen_project
. pre_gen_project
can adjust the filled-in variables before
actually generating the code. post_gen_project
can be used to adjust the code after
generating.
With some if/else and some post_gen_project
cleanup of the generated code, they were
able to limit the amount of generated unnecessary code.
So… use cookiecutter! A great way to help colleagues and users get started in an easy
and right way.