Reputation: 43
I am trying to delete an existing Parquet file and replace it with data in a dataframe that read the data in the original Parquet file before deleting it. This is in Azure Synapse using PySpark.
So I created the Parquet file from a dataframe and put it in the path:
full_file_path
I am trying to update this Parquet file. From what I am reading, you can't edit a Parquet file so as a workaround, I am reading the file into a new dataframe:
df = spark.read.parquet(full_file_path)
I then create a new dataframe with the update:
df.createOrReplaceTempView("temp_table")
df_variance = spark.sql("""SELECT * FROM temp_table WHERE ....""")
and the df_variance dataframe is created.
I then delete the original file with:
mssparkutils.fs.rm(full_file_path, True)
and the original file is deleted. But when I do any operation with the df_variance dataframe, like df_variance.count(), I get a FileNotFoundException error. What I am really trying to do is:
df_variance.write.parquet(full_file_path)
and that is also a FileNotFoundException error. But I am finding that any operation I try to do with the df_variance dataframe is producing this error. So I am thinking it might have to do with the fact that the original full_file_path has been deleted and that the df_variance dataframe maintains some sort of reference to the (now deleted) file path, or something like that. Please help. Thanks.
Upvotes: 0
Views: 226
Reputation: 88971
Spark dataframes aren't collections of rows. Spark dataframes use "deferred execution". Only when you call
df_variance.write
is a spark job run that reads from the source, performs your transformations, and writes to the destination.
A Spark dataframe is really just a query that you can compose with other expressions before finally running it.
You might want to move on from parquet to delta. https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-what-is-delta-lake
Upvotes: 1