combine two rdd in pyspark operation when filtering operation

Question

I have to do an rdd based operation. My operation as follow;

test1 = rdd.filter(lambda y: (y[0] >= y[1])) # condition 1
test2 = rdd.filter(lambda y: (y[0] < y[1])) # condition 2
result1 = test1.collect()
result2 = test2.collect()
print('(',len(result1),',',len(result2),')')

Can I combine these two conditions into one rdd? I tried something like this;

test3 = test1.zip(test2).collect()

But it didnt work. e.g. if I apply collect() to test1 rdd, I get a list. Then I find the length of that list. Similarly I do the same thing for test2 rdd. Now the question is can I do this in one shot? finding lengths of the lists in one go.

jxc · Accepted Answer

IIUC, you can map two conditions into a tuple and convert the resulting boolean values into integers, then do reduce:

# create a sample of rdd with 30 elements
import numpy as np
from operator import add

rdd = sc.parallelize([*map(tuple, np.random.randint(1,100,(30,2)))])

rdd.map(lambda y: (int(y[0] >= y[1]), int(y[0] < y[1]))) \
   .reduce(lambda x,y: tuple(map(add, x,y)))
#(19, 11)

combine two rdd in pyspark operation when filtering operation

Answers (2)

Related Questions