Reputation: 117
I have a number of tables (with varying degrees of differences in schemas but with a common set of fields) that I would like to Union and load from bronze -> Silver in an incremental manner. So the goal is to go from multiple tables to a single table using DLT.
Example:
X_Events
Y_Events
.... N_Events
To: All_Events
I am using a for loop to go through all the databases -> tables and then performing a readStream
followed by a UnionByName
.
However, if there is an additional table added/ modified dynamically that I need to process in the next run, I am getting a checkpoint error.
There are [8] sources in the checkpoint offsets and now there are
[6] sources requested by the query. Cannot continue.
Is there a way to address this dynamically?
Should I build my own incremental logic? Is there a better way to achieve this?
Upvotes: 2
Views: 1708
Reputation: 87279
This is a documented limitation of Spark Structured Streaming:
Changes in the number or type (i.e. different source) of input sources: This is not allowed.
But from your description I see that you may not need to use UnionByName
- you can just have N independent streams that will write into the same table. In case when you just do append to a table, concurrent appends won't lead to a write conflicts (each stream is independent!):
bronze 1 \
bronze 2 \
bronze 3 >--> append to a Silver table
.......... /
bronze N /
In case if you need to do merges into a target table, or some other changes in it, you can still follow the same approach, by appending to an intermediate table, and then having a stream from it, and merging/updating the target table:
bronze 1 \
bronze 2 \
bronze 3 >--> append to an intermediate table --> merge into Silver
.......... /
bronze N /
Upvotes: 4