Andrea
Andrea

Reputation: 4473

Save avro multiple times in the same path using Spark

I'm using databrick spark-avro (for Spark 1.5.5.2) for saving a DataFrame fetched from ElasticSearch as Avro into HDFS. After I have done some processing on my DataFrame, I save the data on HDFS using the following command:

df.write.avro("my/path/to/data")

Everything works fine and I can read my data using Hive. The biggest issue I'm facing, at the time, is that I can't write twice data into the same path (running my script twice with "my/path/to/data" as output, for example). As long as I need to add data incrementally, how can I solve this problem? I thought some workarounds, like

But I wonder if I can find a way to actually solve this problem on Spark.

Upvotes: 1

Views: 1560

Answers (2)

koiralo
koiralo

Reputation: 23099

If your data is not frequently updated the Append works fine as

df.write.mode(SaveMode.Append).avro("outputpath")

If you are frequently updating then it creates a large number of files (it may be empty files too) To overcome this issue you need to

  • Read previous data and append to it
  • Store in temporary directory
  • Delete original directory and
  • Rename temporary directory to original

Hope this helps

Upvotes: 3

user8030881
user8030881

Reputation: 46

You should provide an appropriate mode. Overwrite if you want to replace existing data:

df.write.mode("overwrite").avro("my/path/to/data")

append if you want to add:

df.write.mode("append").avro("my/path/to/data")

Upvotes: 3

Related Questions