Reputation: 21615
Basically, in my program, tasks would be appending to an HDFS file. However, I don't want two tasks to be appending the file at the same time. Is there a mechanism, where I have only one task appending to an HDFS file. Basically a mutex kind of mechanism.I also need such mutex when creating the file.
Upvotes: 1
Views: 422
Reputation: 29165
DataFrames in Spark1.5 and above offer the ability to append to an existing DF on HDFS. Spark internally uses the techniques described by @marios in the other answer.
For example (in Java):
dataframe.write().mode(SaveMode.Append).
format(FILE_FORMAT).partitionBy("parameter1", "parameter2").save(path);
If you check HDFS you will see (example of writing to "hello"):
-rw-r--r-- 3 vagrant supergroup 0 2016-05-13 17:48 /home/hello/_SUCCESS
-rw-r--r-- 3 vagrant supergroup 281 2016-05-13 17:48 /home/hello/_common_metadata
-rw-r--r-- 3 vagrant supergroup 2041 2016-05-13 17:48 /home/hello/_metadata
-rw-r--r-- 3 vagrant supergroup 499 2016-05-13 17:46 /home/hello/part-r-00000-182e0b9b-a15d-47f9-8a3e-07739d6f2534.gz.parquet
-rw-r--r-- 3 vagrant supergroup 499 2016-05-13 17:48 /home/hello/part-r-00000-a8cf0223-69b3-4c2c-88f6-91252d99967c.gz.parquet
-rw-r--r-- 3 vagrant supergroup 499 2016-05-13 17:46 /home/hello/part-r-00001-182e0b9b-a15d-47f9-8a3e-07739d6f2534.gz.parquet
-rw-r--r-- 3 vagrant supergroup 499 2016-05-13 17:48 /home/hello/part-r-00001-a8cf0223-69b3-4c2c-88f6-91252d99967c.gz.parquet
-rw-r--r-- 3 vagrant supergroup 499 2016-05-13 17:46 /home/hello/part-r-00002-182e0b9b-a15d-47f9-8a3e-07739d6f2534.gz.parquet
-rw-r--r-- 3 vagrant supergroup 499 2016-05-13 17:48 /home/hello/part-r-00002-a8cf0223-69b3-4c2c-88f6-91252d99967c.gz.parquet
-rw-r--r-- 3 vagrant supergroup 499 2016-05-13 17:46 /home/hello/part-r-00003-182e0b9b-a15d-47f9-8a3e-07739d6f2534.gz.parquet
-rw-r--r-- 3 vagrant supergroup 499 2016-05-13 17:48 /home/hello/part-r-00003-a8cf0223-69b3-4c2c-88f6-91252d99967c.gz.parquet
Please see different options of save modes suitable for your requirement here
If you are using Spark1.4 Please have look in to SaveMode doc
Upvotes: 1
Reputation: 8996
To the best of my knowledge you cannot have more than one handler writing to the same HDFS file.
This is not a Spark limitation, this is just how HDFS is designed. In HDFS files are immutable. You have a single writer per file and no appends after they are closed. This is great for big data and Spark since you always know that the same file will result in the same data.
The way to solve this in Hadoop is have each writer write their own file and then have a final MapReduce job to coalesce them into one file (if this is something that you really want to have).
Most of the time you can just work with these multiple files. The trick is to have the folder as you container; e.g., /a/b/people
where the people folder has a lot of different files each contain different subset of "people". Spark has no problem reading multiple files and loading them in the same DataFrame or RDD.
Upvotes: 2