Reputation: 6796
I am taking this course.
It says that the reduce operation on RDD is done one machine at a time. That mean if your data is split across 2 computers, then the below function will work on data in the first computer, will find the result for that data and then it will take a single value from second machine, run the function and it will continue that way until it finishes with all values from machine 2. Is this correct?
I thought that the function will start operating on both machines at the same time and then once it has results from 2 machines, it will again run the function for the last time
rdd1=rdd.reduce(lambda x,y: x+y)
will below steps give faster answer as compared to reduce function?
Rdd=[3,5,4,7,4]
seqOp = (lambda x, y: x+y)
combOp = (lambda x, y: x+y)
collData.aggregate(0, seqOp, combOp)
Should both set of codes below execute in same amount time? I checked and it seems that both take the same time.
import datetime
data=range(1,1000000000)
distData = sc.parallelize(data,4)
print(datetime.datetime.now())
a=distData.reduce(lambda x,y:x+y)
print(a)
print(datetime.datetime.now())
seqOp = (lambda x, y: x+y)
combOp = (lambda x, y: x+y)
print(datetime.datetime.now())
b=distData.aggregate(0, seqOp, combOp)
print(b)
print(datetime.datetime.now())
Upvotes: 5
Views: 18586
Reputation: 330393
reduce
behavior differs a little bit between native (Scala) and guest languages (Python) but simplifying things a little:
Since it looks like you're using Python lets take a look at the code:
reduce
creates a simple wrapper for a user provided function:
def func(iterator):
...
This is wrapper is used to mapPartitions
:
vals = self.mapPartitions(func).collect()
It should be obvious this code is embarrassingly parallel and doesn't care how the results are utilized
Collected vals
are reduced sequentially on the driver using standard Python reduce
:
reduce(f, vals)
where f
is a functions passed to RDD.reduce
In comparison Scala will merge partial results asynchronously as they come from the workers.
In case of treeReduce
step 3. can performed in a distributed manner as well. See Understanding treeReduce() in Spark
To summarize reduce
, excluding driver side processing, uses exactly the same mechanisms (mapPartitions
) as the basic transformations like map
or filter
, and provide the same level of parallelism (once again excluding driver code). If you have a large number of partitions or f
is expensive you can parallelism / distribute final merging using tree*
family of methods.
Upvotes: 3