Mike
Mike

Reputation: 593

FileNotFoundException when trying to save DataFrame to parquet format, with 'overwrite' mode

I have this weird error. I have a routine that reads a dataframe if it exists (or creates one otherwise), modifies it, and then saves it again in the same target path in parquet format, with 'overwrite' mode.

In the first run, when there is no dataframe, I create one, and save it. It generates in the output folder 4 files:

Then, in a second run, I read the dataframe, modify it, and when I try to overwrite it, it throws an exception that the *part-r-<.....>.snappy.parquet file does not exist*.

The output folder is empty when the exception occurs, but before the execution of df.write.parquet(path, 'overwrite') the folder contains this file.

I tried to set the spark.sql.cacheMetadata to 'false' but it didn't help. The spark.catalog.listTables() returns an empty list so there is no point to refresh anything.

For now, I simply delete the output folder's items, and write the dataframe. It works. But why the original method with 'overwrite' mode fails??

Thanks.

Upvotes: 13

Views: 8013

Answers (2)

uh_big_mike_boi
uh_big_mike_boi

Reputation: 3470

Another thing to do here is to cache it -

df.cache()

Right after you read it from hdfs.

Upvotes: 1

RBanerjee
RBanerjee

Reputation: 947

RDD's doesn't hold the data like variable , it's just a data structure that knows how to get the data(getPartition) and what to perform as transformation(compute) on that data when an action is called.

So what you are doing here is,

1st time => ... => Save to path A
2nd time onward => read from path A => do some transformation => Save to path A With Override mode

Now notice, your actual action is Save to path A . Util you call an action, Spark only creates the DAG, which knows when an action will be called where to look for data(Save to path A), how to transform them and where to save/show.

But as you are selecting mode override , Spark in it's execution plan adding to delete the path first, then trying to read that path which is already vacant.

So, as an workaround you can save them either different folder like partition basis, or you can save them in two path one destination and one tmp.

Upvotes: 9

Related Questions