(One of my summaries of the 2025 pycon.de conference in Darmstadt, DE).
Barak is a founding engineer at lakeFS.
Local storage is simple and easy. But cloud storage is real handy: scalability, security, etc. But the flexibility is a problem: every cloud storage service introduced its own way of working with it. Slightly different APIs.
fsspec is a python library providing a unified interface for interacting with various storage systems, local and remote. The goal is to make remote systems work as local ones.
Your python code talks to the “fsspec unified interface”, which accepts a file system
identifier/type (like s3
), which activates that filesystem functionality. And then a
file path within that type of filesystem.
Why fsspec?
It simplifies your code.
Consistency.
Enhanced capabilities.
Ecosystem integration.
Extensible and open source.
fsspec implements the standard python .read(), .write(), .glob() etc. And also .seek(...)
,
which you can use to do range requests, something you’d have to do yourself with
s3/boto otherwise.
Pandas can read files from s3 and so, but it needs libraries for it. You can use fsspec and pass the file pointer to pandas.
fsspec has some additional capabilities. For instance caching. By prepending
simplecache::
in front of the regular connection string, fsspec will cache the file
if you read it multiple times.
What if you want to use a file system that isn’t supported (yet)? You can implement your
own! Inherit from an AbstractFileSystem
and implement the necessary methods. He
showed a funny toy example.
He works for lakeFS and showed lakefs-spec, a lakeFS plugin for fsspec: it wasn’t build by them, but contributed by the community! Nice.
LakeFS works pretty much like git does: cheap branches, versioning, files pointing at full files, easy diffs, etc. The actual files are stored in a cloud storage, lakeFS manages the pointers to the files.
LakeFS wants to be atomic. But the cloud providers don’t provide multi-file guarantees. Files will show up, but there might be a second or so delay. So they solve it by adding multiple files to a temporary branch and to only merge it once all files are really there. You can configure if you want to retain the temporary branch in case of an error (for debugging purposes) or if you want to have it automatically removed. He showed a convincing demo.
Cloud storage: necessary, but challenging.
fsspec simplifies it with a unified python interface.
fsspec is extensible and open source.
fsspec integrates nicely and enables caching, transactions and more.
Photo explanation: picture from our 2024 vacation around Kassel (DE)
My name is Reinout van Rees and I program in Python, I live in the Netherlands, I cycle recumbent bikes and I have a model railway.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):