Reputation: 478
I am trying to save the contents of dataset to csv using
df.coalesce(1)
.write()
.format("csv")
.mode("append")
.save(PATH+"/trial.csv");
My aim is to keep appending the results of dataset to trial.csv file. However, it creates a folder called trial.csv and creates csv inside of that. When I run it again, it creates another csv file inside the trail.csv folder. But I just want it to keep appending to one csv file, which I unable to do.
I know we can do some script from outside of code(program) and do it, but can we achieve it from inside our code? I am using Java.
Upvotes: 4
Views: 1062
Reputation: 1118
Appending to an existing file its something hard to do for a distributed, multi-thread application, it will turn something parallelised into a sequential task. The way that is usually achieved, is to persist per thread or task in spark, a single file in the specified path, and this path will be a folder containing all the files. To read them, you can read the complete folder.
If your data is not big, and you really need a single file, try with repartition
method to 1, this will make a single task to write the new data, but it will never append the data to previous files.
You have to be careful, but you can do something like this:
df.union(spark.read(PATH+"/trial.csv"))
.coalesce(1)
.write
.format("csv")
.mode("append")
.save(PATH+"/trial_auxiliar.csv")
Then move it to the previous folder, with spark or a move command of Hadoop. Never write and read in the same job to the same folder, and keep in mind that this won't guarantee the data order.
Upvotes: 1