spark reducebykey and ignore rest

Question

I'm trying to perform a join among two RDDs with the first column as key. The RDDs look like:

RDD1:
(k1,(s11,s12,s13))
(k2,(s21,s22,s23))
(k3,(s31,s32,s33))
...

RDD2:
(k1,(t11,t12,t13))
(k2,(t21,t22,t23))
(k4,(t41,t42,t43))
...

ki from one RDD may or may not find a match from the other. But, if it does find a match, it is going to match with only one row of the other RDD. In other words, ki are primary keys for both RDDs.

I'm doing this by

RDD3=RDD1.union(RDD2).reduceByKey(lambda x,y:(x+y)).filter(lambda x:len(x[1])==6)

The resultant RDD would look like:

RDD3:
(k1,(s11,s12,s13,t11,t12,t13))
(k2,(s21,s22,s23,t21,t22,t23))
...

I want to avoid using filter function while computing RDD3. It looks like an avoidable computation. Is it possible to do this using builtin spark functions? I don't want to use spark-sql or dataframes

akuiper · Accepted Answer

You need the join method followed by a mapValues method to concatenate values from the same key:

rdd1.join(rdd2).mapValues(lambda x: x[0] + x[1]).collect()
# [('k2', ('s21', 's22', 's23', 't21', 't22', 't23')), 
#  ('k1', ('s11', 's12', 's13', 't11', 't12', 't13'))]

spark reducebykey and ignore rest

Answers (1)

Related Questions