Save avro multiple times in the same path using Spark

Question

I'm using databrick spark-avro (for Spark 1.5.5.2) for saving a DataFrame fetched from ElasticSearch as Avro into HDFS. After I have done some processing on my DataFrame, I save the data on HDFS using the following command:

df.write.avro("my/path/to/data")

Everything works fine and I can read my data using Hive. The biggest issue I'm facing, at the time, is that I can't write twice data into the same path (running my script twice with "my/path/to/data" as output, for example). As long as I need to add data incrementally, how can I solve this problem? I thought some workarounds, like

Changing the output directory every day (creating a partitioning), or
Saving the data in a tmp folder and then insert them into a "main" table

But I wonder if I can find a way to actually solve this problem on Spark.

user8030881 · Accepted Answer

You should provide an appropriate mode. Overwrite if you want to replace existing data:

df.write.mode("overwrite").avro("my/path/to/data")

append if you want to add:

df.write.mode("append").avro("my/path/to/data")

Save avro multiple times in the same path using Spark

Answers (2)

Related Questions