Anshul
Anshul

Reputation: 478

Saving spark dataset to an existing csv file

I am trying to save the contents of dataset to csv using

df.coalesce(1)
  .write()
  .format("csv")
  .mode("append")
  .save(PATH+"/trial.csv");

My aim is to keep appending the results of dataset to trial.csv file. However, it creates a folder called trial.csv and creates csv inside of that. When I run it again, it creates another csv file inside the trail.csv folder. But I just want it to keep appending to one csv file, which I unable to do.

I know we can do some script from outside of code(program) and do it, but can we achieve it from inside our code? I am using Java.

Upvotes: 4

Views: 1062

Answers (1)

Alfilercio
Alfilercio

Reputation: 1118

Appending to an existing file its something hard to do for a distributed, multi-thread application, it will turn something parallelised into a sequential task. The way that is usually achieved, is to persist per thread or task in spark, a single file in the specified path, and this path will be a folder containing all the files. To read them, you can read the complete folder.

If your data is not big, and you really need a single file, try with repartition method to 1, this will make a single task to write the new data, but it will never append the data to previous files.

You have to be careful, but you can do something like this:

df.union(spark.read(PATH+"/trial.csv"))
  .coalesce(1)
  .write
  .format("csv")
  .mode("append")
  .save(PATH+"/trial_auxiliar.csv")

Then move it to the previous folder, with spark or a move command of Hadoop. Never write and read in the same job to the same folder, and keep in mind that this won't guarantee the data order.

Upvotes: 1

Related Questions