Reputation:
I have the following script to read to CDC data using spark structured streaming before it can be merged into base delta table.
streamDf = spark \
.readStream \
.format('csv') \
.option("mergeSchema", "true") \
.option('header', 'true') \
.option("path", CDCLoadPath) \
.load()
streamQuery = (streamDf \
.writeStream \
.format("delta") \
.outputMode("append") \
.foreachBatch(mergetoDelta) \
.option("checkpointLocation", f"{CheckpointLoc}/_checkpoint") \
.trigger(processingTime='20 seconds') \
.start())
Whenever I add a new column in the source table, the read stream does not pick up the schema change from source files though the underlying data has a new column. But If I restart the script manually, it updates the schema with the new column. Is there a way for streaming to pick it up while it's running?
Upvotes: 1
Views: 4645
Reputation: 1369
Either you need to have an object which provides schema of the input or you will have to restart for schema inference as per
Upvotes: 2