Spark writing output back to the input directory

Question

I recently encountered a scenario where I need to read input from HDFS from a directory

 /user/project/jsonFile

and write the result back to the same directory:

 /user/project/jsonFile

After reading jsonFile multiple joins are performed and the result is written to /user/project/jsonFile using:

result.write().mode(SaveMode.Overwrite).json("/user/project/jsonFile");

Below is the error I see:

[task-result-getter-0]o.a.s.s.TaskSetManager: Lost task 10.0 in stage 7.0 (TID 2508, hddev1db015dxc1.dev.oclc.org, executor 3): java.io.FileNotFoundException: File does not exist: /user/project/jsonFile
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:87)
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:77)
    
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)

Why is it throwing java.io.FileNotFoundException: File does not exist? result is the dataset containing the output for the joins that is written back to HDFS, once result dataset is available shouldnt spark be able to write data back to HDFS in the same input dir?

Or

this makes me think that some executors completed its join on the input and they are ready to write the result back to HDFS whereas some executors are still in the process of reading the data from same HDFS dir which is being overwritten now causing a FileNotFound. Is that true?

Thanks for any help

Amardeep Flora · Accepted Answer

You are using overwrite when reading and writing from the same dir. One way to to is use Append instead of Overwrite

result.write().mode(SaveMode.Append).json("/user/project/jsonFile");

Another workaround is to store write your data in another folder and then and read from it as the source to your initial location.

read from source
make your data transformations
write transformed data into tempLocation
read from tempLocation
write into source

Spark writing output back to the input directory

Answers (1)

Related Questions