Databricks - Read Streams - Delta Live Tables

Question

I have a number of tables (with varying degrees of differences in schemas but with a common set of fields) that I would like to Union and load from bronze -> Silver in an incremental manner. So the goal is to go from multiple tables to a single table using DLT.

Example:

X_Events
Y_Events
.... N_Events 

To: All_Events

I am using a for loop to go through all the databases -> tables and then performing a readStream followed by a UnionByName.

However, if there is an additional table added/ modified dynamically that I need to process in the next run, I am getting a checkpoint error.

There are [8] sources in the checkpoint offsets and now there are 
[6] sources requested by the query. Cannot continue.

Is there a way to address this dynamically?

Should I build my own incremental logic? Is there a better way to achieve this?

Alex Ott · Accepted Answer

This is a documented limitation of Spark Structured Streaming:

Changes in the number or type (i.e. different source) of input sources: This is not allowed.

But from your description I see that you may not need to use UnionByName - you can just have N independent streams that will write into the same table. In case when you just do append to a table, concurrent appends won't lead to a write conflicts (each stream is independent!):

bronze 1  \
bronze 2   \
bronze 3    >--> append to a Silver table
.......... /
bronze N  /

In case if you need to do merges into a target table, or some other changes in it, you can still follow the same approach, by appending to an intermediate table, and then having a stream from it, and merging/updating the target table:

bronze 1  \
bronze 2   \
bronze 3    >--> append to an intermediate table --> merge into Silver
.......... /
bronze N  /

Databricks - Read Streams - Delta Live Tables

Answers (1)

Related Questions