Can multiple data pipeline merge data on the same delta table simultaneously without causing inconsistency?

Question

I know ACID Transactions are one of the important feature of the delta lake while performing read and write. Is this also true for merge operation? What if two pipelines try to perform update operation based on the different condition on the same record.Can it cause any data inconsistency?

idoa01 · Accepted Answer

Well, it depends.

Delta Lake uses Optimistic Control to handle concurrency, this means that it would likely work IF you're writing to HDFS, since delta needs the underlying object store to support "compare-and-swap" operations or a way for it to fail if two writers are tying to overwrite each other’s log entries, and HDFS supports that.

On S3, this is not supported:

Delta Lake has built-in support for S3. Delta Lake supports concurrent reads from multiple clusters, but concurrent writes to S3 must originate from a single Spark driver in order for Delta Lake to provide transactional guarantees. This is because S3 currently does provide mutual exclusion, that is, there is no way to ensure that only one writer is able to create a file.

On the proprietary Delta Engine, Databricks does support multi cluster writes to S3 using a propriety server that handles those calls.

So to sum it up:

It should be possible if you're writing to HDFS.
On S3, it won't work, unless you're using the paid version of Delta Lake.

Can multiple data pipeline merge data on the same delta table simultaneously without causing inconsistency?

Answers (1)

Related Questions