FileNotFoundException in Azure Synapse when trying to delete a Parquet file and replace it with a dataframe created from the original file

Question

I am trying to delete an existing Parquet file and replace it with data in a dataframe that read the data in the original Parquet file before deleting it. This is in Azure Synapse using PySpark.

So I created the Parquet file from a dataframe and put it in the path:

full_file_path

I am trying to update this Parquet file. From what I am reading, you can't edit a Parquet file so as a workaround, I am reading the file into a new dataframe:

df = spark.read.parquet(full_file_path)

I then create a new dataframe with the update:

df.createOrReplaceTempView("temp_table")
df_variance = spark.sql("""SELECT * FROM temp_table WHERE ....""")

and the df_variance dataframe is created.

I then delete the original file with:

mssparkutils.fs.rm(full_file_path, True)

and the original file is deleted. But when I do any operation with the df_variance dataframe, like df_variance.count(), I get a FileNotFoundException error. What I am really trying to do is:

df_variance.write.parquet(full_file_path)

and that is also a FileNotFoundException error. But I am finding that any operation I try to do with the df_variance dataframe is producing this error. So I am thinking it might have to do with the fact that the original full_file_path has been deleted and that the df_variance dataframe maintains some sort of reference to the (now deleted) file path, or something like that. Please help. Thanks.

David Browne - Microsoft · Accepted Answer

Spark dataframes aren't collections of rows. Spark dataframes use "deferred execution". Only when you call

df_variance.write

is a spark job run that reads from the source, performs your transformations, and writes to the destination.

A Spark dataframe is really just a query that you can compose with other expressions before finally running it.

You might want to move on from parquet to delta. https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-what-is-delta-lake

FileNotFoundException in Azure Synapse when trying to delete a Parquet file and replace it with a dataframe created from the original file

Answers (1)

Related Questions