Reputation: 408
My PySpark
script saves created DataFrame
to a directory:
df.write.save(full_path, format=file_format, mode=options['mode'])
in case I read this file in the same run, everything is fine:
return sqlContext.read.format(file_format).load(full_path)
however, when I try to read the file from this directory in another script run I receive an error:
java.io.FileNotFoundException: File does not exist: /hadoop/log_files/some_data.json/part-00000-26c649cb-0c0f-421f-b04a-9d6a81bb6767.json
I understand that I can find a work around it by Spark's tip:
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
However, I want to know reason of my failure, and what is an orthodox way for such a problem?
Upvotes: 1
Views: 709
Reputation: 654
You are trying to manage two objects related to the same file, so the cache involving this objects is going to give you problems, they are both targeting the same file. A simple solution is in here,
https://stackoverflow.com/a/60328199/5647992
Upvotes: 2