spark reduce function: understand how it works

Question

I am taking this course.

It says that the reduce operation on RDD is done one machine at a time. That mean if your data is split across 2 computers, then the below function will work on data in the first computer, will find the result for that data and then it will take a single value from second machine, run the function and it will continue that way until it finishes with all values from machine 2. Is this correct?

I thought that the function will start operating on both machines at the same time and then once it has results from 2 machines, it will again run the function for the last time

rdd1=rdd.reduce(lambda x,y: x+y)

update 1--------------------------------------------

will below steps give faster answer as compared to reduce function?

Rdd=[3,5,4,7,4]
seqOp = (lambda x, y: x+y)
combOp = (lambda x, y: x+y)
collData.aggregate(0, seqOp, combOp)

Update 2-----------------------------------

Should both set of codes below execute in same amount time? I checked and it seems that both take the same time.

import datetime

data=range(1,1000000000)
distData = sc.parallelize(data,4)
print(datetime.datetime.now())
a=distData.reduce(lambda x,y:x+y)
print(a)
print(datetime.datetime.now())

seqOp = (lambda x, y: x+y)
combOp = (lambda x, y: x+y)
print(datetime.datetime.now())
b=distData.aggregate(0, seqOp, combOp)
print(b)
print(datetime.datetime.now())

spark reduce function: understand how it works

update 1--------------------------------------------

Update 2-----------------------------------

Answers (1)

Related Questions