Marcel Mars
Marcel Mars

Reputation: 408

Pyspark save file as parquet and read

My PySpark script saves created DataFrame to a directory:

df.write.save(full_path, format=file_format, mode=options['mode'])

in case I read this file in the same run, everything is fine:

return sqlContext.read.format(file_format).load(full_path)

however, when I try to read the file from this directory in another script run I receive an error:

java.io.FileNotFoundException: File does not exist: /hadoop/log_files/some_data.json/part-00000-26c649cb-0c0f-421f-b04a-9d6a81bb6767.json

I understand that I can find a work around it by Spark's tip:

It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

However, I want to know reason of my failure, and what is an orthodox way for such a problem?

Upvotes: 1

Views: 709

Answers (1)

Joe9008
Joe9008

Reputation: 654

You are trying to manage two objects related to the same file, so the cache involving this objects is going to give you problems, they are both targeting the same file. A simple solution is in here,

https://stackoverflow.com/a/60328199/5647992

Upvotes: 2

Related Questions