Reputation: 4473
I'm using databrick spark-avro (for Spark 1.5.5.2
) for saving a DataFrame fetched from ElasticSearch as Avro into HDFS. After I have done some processing on my DataFrame, I save the data on HDFS using the following command:
df.write.avro("my/path/to/data")
Everything works fine and I can read my data using Hive. The biggest issue I'm facing, at the time, is that I can't write twice data into the same path (running my script twice with "my/path/to/data"
as output, for example). As long as I need to add data incrementally, how can I solve this problem? I thought some workarounds, like
But I wonder if I can find a way to actually solve this problem on Spark.
Upvotes: 1
Views: 1560
Reputation: 23099
If your data is not frequently updated the Append works fine as
df.write.mode(SaveMode.Append).avro("outputpath")
If you are frequently updating then it creates a large number of files (it may be empty files too) To overcome this issue you need to
Hope this helps
Upvotes: 3
Reputation: 46
You should provide an appropriate mode. Overwrite if you want to replace existing data:
df.write.mode("overwrite").avro("my/path/to/data")
append if you want to add:
df.write.mode("append").avro("my/path/to/data")
Upvotes: 3