Delta Connectors 0.3.0 Released
We are excited to announce the release of Delta Connectors 0.3.0, which introduces support for writing Delta tables. The key features in this release are:
Delta Standalone
-
Write functionality - This release introduces new APIs to support creating and writing Delta tables without Apache Spark™. External processing engines can write parquet data files themselves and then use the APIs to add the files to the Delta table atomically. Following the Delta Transaction Log Protocol, the implementation uses optimistic concurrency control to manage multiple writers, automatically generates checkpoint files, and manages log and checkpoint cleanup according to the protocol. The main Java class exposed is
OptimisticTransaction, which is accessed viaDeltaLog.startTransaction().OptimisticTransaction.markFilesAsRead(readPredicates)must be used to read all metadata during the transaction (and not theDeltaLog). It is used to detect concurrent updates and determine if logical conflicts between this transaction and previously-committed transactions can be resolved.OptimisticTransaction.commit(actions, operation, engineInfo)is used to commit changes to the table. If a conflicting transaction has been committed first (see above) an exception is thrown, otherwise the table version that was committed is returned.- Idempotent writes can be implemented using
OptimisticTransaction.txnVersion(appId)to check for version increases committed by the same application. - Each commit must specify the
Operationbeing performed by the transaction. - Transactional guarantees for concurrent writes on Microsoft Azure and Amazon S3. This release includes custom extensions to support concurrent writes on Azure and S3 storage systems, which on their own do not have the necessary atomicity and durability guarantees. Please note that transactional guarantees are only provided for concurrent writes on S3 from a single cluster.
-
Memory-optimized iterator implementation for reading files in a snapshot -
DeltaScanintroduces an iterator implementation for reading theAddFilesin a snapshot with support for partition pruning. It can be accessed viaSnapshot.scan()orSnapshot.scan(predicate), the latter of which filters files based on thepredicateand any partition columns in the file metadata. This API significantly reduces the memory footprint when reading the files in aSnapshotand when instantiating aDeltaLog(due to internal utilization). -
Partition filtering for metadata reads and conflict detection in writes - This release introduces a simple expression framework for partition pruning in metadata queries. When reading files in a snapshot, filter the returned
AddFileson partition columns by passing apredicateintoSnapshot.scan(predicate). When updating a table during a transaction, specify which partitions were read by passing areadPredicateintoOptimisticTransaction.markFilesAsRead(readPredicate)to detect logical conflicts and avoid transaction conflicts when possible. -
Miscellaneous updates:
ParquetSchemaConverterconverts aStructTypeschema to a Parquet schema.Iterator<VersionLog> DeltaLog.getChanges()exposes an incremental metadata changes API. VersionLog wraps the version number, and the list of actions in that version.- Fix #197 for
RowRecordso that values in partition columns can be read. - Miscellaneous bug fixes.
Delta Connectors
-
Hive 3 support for the Hive Connector
-
Microsoft PowerBI connector for reading Delta tables natively - Read Delta tables directly from PowerBI from any storage system supported by PowerBI without running a Spark cluster. Features include online/scheduled refresh in the PowerBI service, support for Delta Lake time travel (e.g.
VERSION AS OF), and partition elimination using the partition schema of the Delta table. For more details see the dedicated README.md.
Credits
Alex, Allison Portis, Denny Lee, Gerhard Brueckl, Pawel Kubit, Scott Sandre, Shixiong Zhu, Wang Wei, Yann Byron, Yuhong Chen, gurunath
Visit the release notes to learn more about the release.