user16344431
user16344431

Reputation:

Structured Streaming spark.sql.streaming.schemaInference not handling schema changes

sparkSession.config("spark.sql.streaming.schemaInference", true).getOrCreate();
Dataset<Row> dataset = sparkSession.readStream().parquet("file:/files-to-process");

StreamingQuery streamingQuery =
dataset.writeStream().option("checkpointLocation", "file:/checkpoint-location")
.outputMode(Append()).start("file:/save-parquet-files");

streamingQuery.awaitTermination();

After streaming query started If there's a schema changes on new paruet files under files-to-process directory. Structured Streaming not writing new schema changes. Is it possible to handle these schema changes in Structured Streaming.

Upvotes: 0

Views: 473

Answers (1)

Vindhya G
Vindhya G

Reputation: 1369

Perhaps this could help from spark doc https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#schema-inference-and-partition-of-streaming-dataframesdatasets

Partition discovery does occur when subdirectories that are named /key=value/ are present and listing will automatically recurse into these directories. If these columns appear in the user-provided schema, they will be filled in by Spark based on the path of the file being read. The directories that make up the partitioning scheme must be present when the query starts and must remain static. For example, it is okay to add /data/year=2016/ when /data/year=2015/ was present, but it is invalid to change the partitioning column (i.e. by creating the directory /data/date=2016-04-17/)

Upvotes: 0

Related Questions