user1870400
user1870400

Reputation: 6364

What is the difference between forEachAsync vs forEachPartitionAsync in Apache Spark?

What is the difference between forEachAsync vs forEachPartitionAsync?

If I were to guess here is I would say the following but please correct me if I am wrong.forEachAsync just iterate through values from all partitions one by one in an Async Manner

forEachPartitionAsync: Fan out each partition and run the lambda for each partition in parallel across different workers. The lambda here will Iterate through values from that partition one by one in Async manner

but wait, rdd operations should infact execute in parallel right? so if I call rdd.forEachAsync that should execute in parallel too isn't it? I guess I am a little confused what the difference really is now between forEachAsync vs forEachPartitionAsync? besides passing in Tuple vs Iterator of Tuples to the lambda respectively.

Upvotes: 3

Views: 3514

Answers (1)

RBanerjee
RBanerjee

Reputation: 957

I believe you are already aware of the fact of Async, and asking for the difference between forEach and forEachPartition,

And the difference is, ForEachPartition will allow you to run per partition custom code which you can't do with ForEach.

For Example, You want to save your result to database. Now as you know that opening closing DB connections are costly, one connection(or pool) per executor will be best. So you code would be

rdd.forEachPartition(part => {
    db= mysql..blablabla
    part.forEach(record=> {
    db.save(record)
   })
   db.close()
})

You can't do this in ForEach, in foreach it will iterate for each record.

Remember, One partition will always run on one executor. So if you have any costly pre-work to do before start processing the data use forEachParition. If not just use forEach. Both are parallel. One gives you flexibility other gives simplicity.

Upvotes: 4

Related Questions