Reputation: 3071
I recently encountered a scenario where I need to read input from HDFS from a directory
/user/project/jsonFile
and write the result back to the same directory:
/user/project/jsonFile
After reading jsonFile multiple joins are performed and the result is written to /user/project/jsonFile using:
result.write().mode(SaveMode.Overwrite).json("/user/project/jsonFile");
Below is the error I see:
[task-result-getter-0]o.a.s.s.TaskSetManager: Lost task 10.0 in stage 7.0 (TID 2508, hddev1db015dxc1.dev.oclc.org, executor 3): java.io.FileNotFoundException: File does not exist: /user/project/jsonFile
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:87)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:77)
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
Why is it throwing java.io.FileNotFoundException: File does not exist?
result
is the dataset containing the output for the joins that is written back to HDFS, once result
dataset is available shouldnt spark be able to write data back to HDFS in the same input dir?
Or
this makes me think that some executors completed its join on the input and they are ready to write the result back to HDFS whereas some executors are still in the process of reading the data from same HDFS dir which is being overwritten now causing a FileNotFound. Is that true?
Thanks for any help
Upvotes: 1
Views: 1072
Reputation: 1380
You are using overwrite when reading and writing from the same dir. One way to to is use Append instead of Overwrite
result.write().mode(SaveMode.Append).json("/user/project/jsonFile");
Another workaround is to store write your data in another folder and then and read from it as the source to your initial location.
read from source
make your data transformations
write transformed data into tempLocation
read from tempLocation
write into source
Upvotes: 6