nested iterations with Apache Spark?

Question

I'm considering Apache Spark (in java) for a project, but this project requires the data processing framework to support nested iterations. I haven't been able to find any confirmation on that, does it support it? In addition, is there any example of the use of nested iterations?

Thanks!

Sean Owen · Accepted Answer

Just about anything can be done, but the question is what fits the execution model well enough to bother. Spark's operations are inherently parallel, not iterative. That is, some operations happens in parallel to a bunch of pieces of the data, rather than, something happens to each piece sequentially (and then happens again).

However a Spark (driver) program is just a program and can do whatever you want, locally. Of course, nested loops or whatever you like are entirely fine just as in any scala program.

I think you might use Spark operations for the bucketing process and to compute summary stats for each bucket, but otherwise run the simple remainder of the logic locally on the driver.

So the process is:

Broadcast a bucketing scheme
Bucket according to that scheme in a distributed operation
Pull small summary stats to the driver
Update bucketing scheme and send again
repeat...

nested iterations with Apache Spark?

Answers (1)

Related Questions