The Daft team is excited to announce that we now support reading from Delta Lake!
We released a blogpost on the Delta Lake blog, but here’s the TLDR.
What is Delta Lake?
Delta Lake is an open table format which provides an abstraction over a table of data. Under the hood, this table of data is represented with Parquet files (which Daft is able to read extremely efficiently), but Delta Lake keeps track of metadata about these files to allow users of the table to efficiently query large terabyte-scale tables!
What’s Cool?
In the blogpost, we go through some benchmarks locally with other local engines (Polars, Pandas and DuckDB). Daft’s integrations with Delta Lake allows it to outperform pandas by 15.8x, DuckDB by 2.3x, and Polars by 2x for partitioned and z-ordered Delta Lake tables.
What’s Coming Up?
We have some exciting new features around Delta Lake in the pipeline too. These include:
Read support for deletion vectors (issue)
Read support for column mappings (issue)
Writing new Delta Lake tables (issue, in-progress PR)
Writing back to an existing table with appends, overwrites, upserts, or deletes (issue)
P.S., if you’re interested in exploring the intersection of modern data and ML stacks, our team is hiring! :)