Ben
Ben

Reputation: 966

nested iterations with Apache Spark?

I'm considering Apache Spark (in java) for a project, but this project requires the data processing framework to support nested iterations. I haven't been able to find any confirmation on that, does it support it? In addition, is there any example of the use of nested iterations?

Thanks!

Upvotes: 1

Views: 188

Answers (1)

Sean Owen
Sean Owen

Reputation: 66886

Just about anything can be done, but the question is what fits the execution model well enough to bother. Spark's operations are inherently parallel, not iterative. That is, some operations happens in parallel to a bunch of pieces of the data, rather than, something happens to each piece sequentially (and then happens again).

However a Spark (driver) program is just a program and can do whatever you want, locally. Of course, nested loops or whatever you like are entirely fine just as in any scala program.

I think you might use Spark operations for the bucketing process and to compute summary stats for each bucket, but otherwise run the simple remainder of the logic locally on the driver.

So the process is:

  • Broadcast a bucketing scheme
  • Bucket according to that scheme in a distributed operation
  • Pull small summary stats to the driver
  • Update bucketing scheme and send again
  • repeat...

Upvotes: 4

Related Questions