Reputation: 1106
I have the followings rdds that I want to join them using leftOuterJoin. I was wondering if reduceByKey would be more efficient/faster than leftOuterJoin.
rd0= sc.parallelize([ ('s1', 'o1' ),("s1", 'o2' ),('s2','o2'),("s3",'o3')])
rd1= sc.parallelize([ ('s1', 'oo1' ),("s10", 'oo10' ),('s2','oo2')])
reduceByKeyMethod
rd00 = rd0.map(lambda x:(x[0],([x[1]],[])))
rd11 = rd1.map(lambda x:(x[0],([],[x[1]])))
rd00.union(rd11).reduceByKey(lambda x,y:(x[0]+y[0],x[1]+y[1])).collect()
Out[22]:
[('s1', (['o1'], [])),
('s1', (['o2'], [])),
('s2', (['o2'], [])),
('s3', (['o3'], [])),
('s1', ([], ['oo1'])),
('s10', ([], ['oo10'])),
('s2', ([], ['oo2']))]
vs using leftOuterJoin directly rd0.leftOuterJoin(rd1)
Will using reduceByKey be faster for large rd0 and rd1 datasets?
Upvotes: 0
Views: 131
Reputation: 655
If we check execution plan for both approaches => There Should be No Difference
As shown using toDebugString
print(rd00.union(rd11).reduceByKey(lambda x,y:(x[0]+y[0],x[1]+y[1])).toDebugString())
Prints
(4) PythonRDD[15] at RDD at PythonRDD.scala:49 []
| MapPartitionsRDD[14] at mapPartitions at PythonRDD.scala:129 []
| ShuffledRDD[13] at partitionBy at NativeMethodAccessorImpl.java:0 []
+-(4) PairwiseRDD[12] at reduceByKey at <stdin>:1 []
| PythonRDD[11] at reduceByKey at <stdin>:1 []
| UnionRDD[10] at union at NativeMethodAccessorImpl.java:0 []
| PythonRDD[2] at RDD at PythonRDD.scala:49 []
| ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:184 []
| PythonRDD[3] at RDD at PythonRDD.scala:49 []
| ParallelCollectionRDD[1] at parallelize at PythonRDD.scala:184 []
And leftOuterJoin
print(rd00.leftOuterJoin(rd11).toDebugString())
Prints
(4) PythonRDD[23] at RDD at PythonRDD.scala:49 []
| MapPartitionsRDD[22] at mapPartitions at PythonRDD.scala:129 []
| ShuffledRDD[21] at partitionBy at NativeMethodAccessorImpl.java:0 []
+-(4) PairwiseRDD[20] at leftOuterJoin at <stdin>:1 []
| PythonRDD[19] at leftOuterJoin at <stdin>:1 []
| UnionRDD[18] at union at NativeMethodAccessorImpl.java:0 []
| PythonRDD[16] at RDD at PythonRDD.scala:49 []
| ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:184 []
| PythonRDD[17] at RDD at PythonRDD.scala:49 []
| ParallelCollectionRDD[1] at parallelize at PythonRDD.scala:184 []
Upvotes: 2