ETL? Extract, Transform, Load. It goes hand-in-hand with the traditional data warehousing. But that’s the traditional sense. You can also see it as generic data transformation. The principles of ETL-ing are applicable.
ETL often implies batch running. You could do it streaming, but most people still use batch processing.
There are lots of commercial ETL tools. They have problems, however.
She suggest using python instead:
Airflow is a python platform to programmatically author, schedule and monitor workflows. It is a great tool to have the user-friendliness of commercial ETL tools and the flexibility of Python.
Understanding airflow means understanding the scheduler. (This part of the presentation went too fast for me, so “read the documentation”. She mentioned an article, I think it is this one)
Passing data from one task to another is an airflow anti-patern. Tasks should be executable on different machines, so data should be accessible from every machine, You don’t want to have to start a task on the machine where the previous stage ran. There’s some limited communication possible via xcom.
The easiest way to start experimenting? Use docker. https://github.com/puckel/docker-airflow . For deployment, they use ansible. There are recipes.
Photo explanation: simply a picture from my train trip (with a nice planned detour through the Eifel) from Utrecht (NL) to Karlsruhe (DE). ICE to Köln entering Utrecht central station.
My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):