Why is schemaEvolution not working in databricks autoloader?

Question

I'm reading csv files and processing them daily so I can append the data to my bronze layer in databricks using autolader. The code looks like this:

    def run_autoloader(table_name, checkpoint_path, latest_file_location, new_columns):
# Configure Auto Loader to ingest parquet data to a Delta table
  (spark.readStream
    .format("cloudFiles")
    #.schema(df_schema)
    .option("cloudFiles.format", "parquet")
    .option("cloudFiles.schemaLocation", checkpoint_path)
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
    .load(latest_file_location)
    .toDF(*new_columns)
    .select("*", spark_col("_metadata.file_path").alias("source_file"), current_timestamp().alias("processing_time"),current_date().alias("processing_date"))
    .writeStream
    .option("checkpointLocation", checkpoint_path)
    .trigger(once=True)
    .option("mergeSchema", "true")
    .toTable(table_name))

Previously this was able to handle evolving schemas, but today after the introduction of a new column in the input csv's I got the following error:

 requirement failed: The number of columns doesn't match.

I've read some posts suggesting editing the schema manually or resetting the schema by deleting the schema checkpoint path, but one would require manual maintenance and the other would mean we have to wipe all our bronze data so for now neither is an option, especially if it's only a temporary fix.

I don't understand why this suddenly started happening as this is specifically what the autoloader was designed to do.

Any help would be much appreciated.

Why is schemaEvolution not working in databricks autoloader?

Answers (1)

Related Questions