Delta Lake 4.0.0
We are excited to announce the final release of Delta Lake 4.0.0! This release includes several exciting new features.
Highlights
- [Spark] Preview support for catalog-managed tables, a new table feature that transforms Delta Lake into a catalog-oriented lakehouse table format. This feature is still in the RFC stage, and as such, the protocol is still under development and is subject to change.
- [Spark] Delta Connect is an extension for Spark Connect which enables the usage of Delta over Spark Connect, allowing Delta to be used with the decoupled client-server architecture of Spark Connect.
- [Spark] Support for the Variant data type to enable semi-structured storage and data processing, for flexibility and performance.
- [Spark] Support a new DROP FEATURE implementation that allows dropping table features instantly without truncating history.
- [Kernel] Support for reading and writing version checksum.
- [Kernel] Support reading log compaction files for better performance during snapshot construction, and support writing log compaction files as a post commit hook.
- [Kernel] Support for the Clustered Table feature which enables defining and updating the clustering columns on a table.
- [Kernel] Support for writing to row tracking enabled tables.
- [Kernel] Support for writing file statistics to the Delta log when they are provided by the engine. This enables data skipping using query filters at read time.
Details by each component.
Sunset of Delta Standalone and dependent connectors
Currently, Delta Standalone and its dependent connectors, including Delta Flink and Delta Hive, are no longer under active development. Starting in Delta 4.0 we will not be releasing these projects as part of the 4.x Delta releases. These connectors are in maintenance mode and, going forward, will only receive critical security fixes and high-severity bug patches in the 3.x series. We are committed to a full transition from Delta Standalone to Delta Kernel and a future Kernel-based Flink connector.
Delta Spark
Delta Spark 4.0 is built on Apache Spark™ 4.0 . Similar to Apache Spark, we have released Maven artifacts for Scala 2.13.
- Documentation: https://docs.delta.io/4.0.0/index.html
- API documentation: https://docs.delta.io/4.0.0/delta-apidoc.html#delta-spark
- Maven artifacts: delta-spark_2.13, delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb, delta-connect-client_2.13, delta-connect-common_2.13,
The key features of this release are:
- Delta Connect adds Spark Connect support to Scala and Python APIs of Delta Lake for Apache Spark. Spark Connect is a new project released in Apache Spark 4.0 that adds a decoupled client-server infrastructure which allows remote connectivity from Spark from everywhere. Delta Connect makes the DeltaTable interfaces compatible with the new Spark Connect protocol. For more information on how to use Delta Connect, see the Delta Connect documentation. Delta Connect is currently in preview.
- Preview support for catalog-managed tables: Delta Spark now supports reading from and writing to tables that have the
catalogOwned-previewfeature enabled. This feature allows a catalog to broker all commits to the table it manages, giving the catalog the control and visibility it needs to prevent invalid operations (e.g. commits that violate foreign key constraints), enforce security and access controls, and opens the door for future performance optimizations. Currently write support includesINSERT,MERGE INTO,UPDATE, andDELETEoperations.- Note: this feature is still in the RFC stage, and as such, the protocol is still under development and is subject to change. The
catalogOwned-previewfeature should not be enabled for production tables and tables created with this preview feature enabled may not be compatible with future Delta Spark releases.
- Note: this feature is still in the RFC stage, and as such, the protocol is still under development and is subject to change. The
- Support for Variant data type: The Variant data type is a new Apache Spark data type. The Variant data type enables flexible, and efficient processing of semi-structured data, without a user-specified schema. Variant data does not require a fixed schema on write. Instead, Variant data is queried using the schema-on-read approach. The Variant data type allows flexible ingestion by not requiring a write schema, and enables faster processing with the . This feature was originally released in preview as part of , as of 4.0.0 this feature is no longer in preview. Please see the and the for more details.
Other notable changes include:
- Support dropping table features using the DeltaTable Scala/Python APIs with
deltaTable.dropFeatureSupport. - Support dropping the
deletionVectortable feature. - Support DataFrameReader options to unblock non-additive schema changes when streaming.
- Invariant checks for DML commands to detect potential bugs in Delta or Spark earlier during execution and prevent committing the transaction in these cases.
- Support the
timestampdiffandtimestampaddexpressions for generated columns. - Support sorting within partitions when Z-ordering. This can be enabled using the Spark conf
spark.databricks.io.skipping.mdc.sortWithinPartitions(disabled by default) to improve data skipping at the Parquet level. - Miscellaneous bug fixes
- Fix and to resolve struct fields by-name instead of by-position for structs nested inside map types during an update.
Delta Kernel Java
- API documentation: https://docs.delta.io/4.0.0/api/java/kernel/index.html
- Maven artifacts: delta-kernel-api, delta-kernel-defaults
The Delta Kernel project is a set of Java and Rust libraries for building Delta connectors that can read and write to Delta tables without the need to understand the Delta protocol details.
The key features of this release are:
- Support loading and writing version checksum in Java Kernel: Java Kernel now supports loading and writing version checksum for every table commit via post commit hook. Detailed metrics like file counts, table size, and data distribution histograms bring stronger consistency guarantees and better debugging tools to your data ecosystem. The checksum is also used to bypass the reading of multiple log files to retrieve the protocol and metadata actions in the Java Kernel, resulting in a decreased snapshot initialization latency.
- Support reading log-compaction files when reading the delta log during log-replay. This provides a speedup for Snapshot construction and therefore benefits any processes that require creating a snapshot, like scanning or writing to a table.
- Support writing log compaction files as a post commit hook. If the table is in a state that requires a compaction file be created, this hook will be returned from the transaction commit. Invoking the hook will build and write the compaction file. The interval between compactions can be set on the
TransactionBuildervia callingTransactionBuilder.withLogCompactionInterval. - Support the . This enables Kernel to define and update the clustering columns on a table, making clustering information available for Delta clustering implementations. Users can now use to create a clustered table or update existing clustering columns.
Other notable changes include:
- Additional functionality in
TransactionandTransactionBuilder- Support adding and removing Domain Metadata during a Transaction.
- Optimization to avoid loading the existing domain metadata from the snapshot when no domain metadata is removed in a transaction.
- Support setting and unsetting user-facing table properties with
txnBuilder.withTablePropertiesandtxnBuilder.withTablePropertiesRemoved. - Support getting the read table version on the Kernel Transaction API with
getReadTableVersion. - Support configuring the number of retries Kernel attempts when encountering a concurrent transaction during commit with .
- Support adding and removing Domain Metadata during a Transaction.
Note: basic schema evolution support via providing an updated schema to the txnBuilder.withSchema method is close to completion and just missed the code cutoff for this release. Look out for this exciting change soon!
Delta Sharing
In this release of Delta Sharing Spark we have upgraded delta-sharing-client from 1.2.2 to 1.3.2. This enables the following changes:
- Upgrade Spark to version 4.0.0: platform upgrades to update spark to version 4.0.0, Java to 17 and Scala to 2.13.
- Optimized cache usage for improved performance: Simplified key structures of the Spark Parquet IO cache, which enables cache reuse in the Spark Parquet IO layer for identical queries to speed up performance.
- Enhanced logging and error propagation for better observability
- Added detailed logging to critical Delta Sharing client code paths to aid debugging. This will help with identifying the root cause of client side exceptions.
- Improved error propagation by surfacing server-side error messages to the client in streaming query failure scenarios.
Limitations
In Delta Spark, UniForm with Iceberg is unavailable currently due to their lack of support for Spark 4.0. This will be enabled in a future release.
Credits
Ada Ma, Ala Luszczak, Alexey Shishkin, Allison Portis, Ami Oka, Amogh Jahagirdar, Andreas Chatzistergiou, Andrei Tserakhau, Andy Lam, Anoop Johnson, Anton Erofeev, Anurag Vaibhav, Bilal Akhtar, Carmen Kwan, Charlene Lyu, ChengJi-db, Chirag Singh, Christos Stavrakakis, Cuong Nguyen, Dhruv Arya, Dušan Tišma, Felipe Pessoto, FredLiu, Gene Pang, Hao Jiang, Harsh Motwani, Herman van Hovell, Jiaheng Tang, Johan Lasperas, Juliusz Sompolski, Jun, Kaiqi Jin, Lars Kroll, Lin Zhou, Livia Zhu, Lukas Rupprecht, Malte Sølvsten Velin, Marko Ilić, Ming Dai, Nick Lanham, Ole Sasse, Omar Elhadidy, Oussama Saoudi, Paddy Xu, Phil Plato, Qiyuan Dong, Rahul Shivu Mahadev, Rajesh Parangi, Rakesh Veeramacheneni, Scott Sandre, Slava Min, Stefan Kandic, Sumeet Varma, Thang Long Vu, Tom van Bussel, Venkata Sai Akhil Gudesa, Venki Korukanti, Vladimir Golubev, Wei Luo, Wenchen Fan, Xiaochong Wu, Xin Huang, Yumingxuan Guo, Ze'ev Maor, Zhipeng Mao, Zihao Xu, Ziya Mukhtarov, chenjian2664, emkornfield, jackierwzhang, kamcheungting-db, littlegrasscao, mozasaur, richardc-db