Reputation: 11445
We have a spark cluster with 10 nodes. I have a process that joins couple of dataframes and then saves the result to a s3 location. We are running in cluster mode. When I call save on the dataframe, does it save from the nodes or does it collect all the result to the driver and write it from the driver to the s3. is there a way to verify this.
Upvotes: 2
Views: 1332
Reputation: 13440
RDD.save()
triggers an evaluation of the entire query.
The work is partitioned by source data (i.e. files), and any splitting which can be done, pushing individual tasks to available executors, collecting the results and finally writing it to the destination directory using the cross-node protocol defined in implementations of FileCommitProtocol
, generally HadoopMapReduceCommitProtocol
, which then works with Hadoop's FileOutputCommitter
to choreograph the commit.
Essentially:
__temporary/$job-attempt/$task-attempt
On the question of "writing to s3", if you are using Apache Spark (and not amazon EMR), then be aware that this list + rename commit mechanism is (a) slow as renames are really copies, and (b) dangerous as the eventual consistency of S3, especially list inconsistency, means that files saved by tasks may not be listed, hence not committed
At the time of writing (May 2017), the sole committer known to safely commit using s3a or s3n clients is the netflix committer.
There's work underway to pull this into Hadoop and hence spark, but again, in May 2017 it's still a work in progress: demo state only. I speak as the engineer doing this.
To close then: If you want reliable data output writing to S3, consult whoever is hosting your code on EC2. If you are using out-the-box Apache Spark without any supplier-specific code, do not write direct to S3. It may work in tests, but as well as seeing the intermittent failure, you may lose data and not even notice. Statistics are your enemy here: the more work you do, the bigger the datasets, the more tasks that are executed -and so eventually something will go wrong.
Upvotes: 1