(One of my summaries of a talk at the 2017 PyCon.de conference).
ETL? Extract, Transform, Load. It goes hand-in-hand with the traditional data warehousing. But that’s the traditional sense. You can also see it as generic data transformation. The principles of ETL-ing are applicable.
ETL often implies batch running. You could do it streaming, but most people still use batch processing.
There are lots of commercial ETL tools. They have problems, however.
They are mostly designed to deal with well structured data. They were made for moving from one DB system to another. They don’t match well with the variety of sources you tend to have now.
Mostly not open source
You’re limited to the build-in fuctionality.
She suggest using python instead:
Easy.
More flexible.
Reusing logic is easier. Abstraction is possible.
You can test the logic.
Versioning and collaboration.
Airflow is a python platform to programmatically author, schedule and monitor workflows. It is a great tool to have the user-friendliness of commercial ETL tools and the flexibility of Python.
It is python!
It is open source. Note: it recently became an “apace incubator” project. There’s a good community.
Really nice: dynamic pieline and task creation. You can create them programmatically on the fly.
Understanding airflow means understanding the scheduler. (This part of the presentation went too fast for me, so “read the documentation”. She mentioned an article, I think it is this one)
Passing data from one task to another is an airflow anti-patern. Tasks should be executable on different machines, so data should be accessible from every machine, You don’t want to have to start a task on the machine where the previous stage ran. There’s some limited communication possible via xcom.
The easiest way to start experimenting? Use docker. https://github.com/puckel/docker-airflow . For deployment, they use ansible. There are recipes.
Photo explanation: simply a picture from my train trip (with a nice planned detour through the Eifel) from Utrecht (NL) to Karlsruhe (DE). ICE to Köln entering Utrecht central station.
My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):