Pycon.de: open table formats - Franz Wöllert¶

(One of my summaries of the 2025 pycon.de conference in Darmstadt, DE).

Open table formats are something that changed the data landscape for the better in the last few years.

It all begins with data. All our advancements as human beings. Data fuels knowledge creation. What would we do without books, for instance?

He works for a company that builds big printing machines (Heidelberg Druckmaschinen). Those machines produce lots of monitoring data. In 2015 they used spark, hadoop and cassandra as big data platform, which was state-of-the-art at the time. But… they’re hoping to finally shut it down this year. It was expensive, difficult to maintain and limited in scalability (at least in the way they set it up).

They started using the cloud. AWS, google cloud and azure promise a lot. Scalability, servers optimised for different use-cases, etcetera. But moving from your own hadoop instance to the cloud isn’t easy. Snowflake and databricks are data platform giants that promise to take a lot of this kind of work off your hands.

One of the apache techniques they still use is apache parquet because it has strict types (int, string, bool, float). Those strict types help a lot with testing. Another advantage: efficient encodings.

Best advantage: skipping irrelevant data. They partition the data and store metadata (like min/max values) per part. That helps in skipping reading data when you know you don’t need it anyway.

But there are drawbacks like missing ACID support.

But… there are “open table format” tools like iceberg and delta lake on top of parquet that provide such missing functionality like ACID/transactions, schema evolution, handy partitioning, versioning/time-travel and more. Something they’re loooking at is liquid clustering, a more automatic/flexible partitioning mechanism (at least, that’s the promise, but they encountered problems).

Reading data out of delta lake was a bit of a problem. They had to remove some features from their tables before tools could actually read the data: you have to watch out which features your table uses versus what your tool supports.

Delta lake is open source, but most of the contributers are from the closed-source databricks. Apache iceberg, an alternative, is much more diverse. From a community perspective, iceberg has an edge.

In the end, open table formats make data engineering life better. Having your data stored in a proprietary format at a commercial company isn’t ideal: it is great that there are open formats now.

https://reinout.vanrees.org/images/2025/pycon-13.jpeg

Photo explanation: picture from the recent “on traxs” model railway exhibition in Utrecht (NL)

Reinout van Rees

My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):