Reputation: 1605
I have to do an rdd based operation. My operation as follow;
test1 = rdd.filter(lambda y: (y[0] >= y[1])) # condition 1
test2 = rdd.filter(lambda y: (y[0] < y[1])) # condition 2
result1 = test1.collect()
result2 = test2.collect()
print('(',len(result1),',',len(result2),')')
Can I combine these two conditions into one rdd? I tried something like this;
test3 = test1.zip(test2).collect()
But it didnt work. e.g. if I apply collect()
to test1 rdd, I get a list. Then I find the length of that list. Similarly I do the same thing for test2 rdd. Now the question is can I do this in one shot? finding lengths of the lists in one go.
Upvotes: 1
Views: 256
Reputation: 14008
IIUC, you can map two conditions into a tuple and convert the resulting boolean values into integers, then do reduce:
# create a sample of rdd with 30 elements
import numpy as np
from operator import add
rdd = sc.parallelize([*map(tuple, np.random.randint(1,100,(30,2)))])
rdd.map(lambda y: (int(y[0] >= y[1]), int(y[0] < y[1]))) \
.reduce(lambda x,y: tuple(map(add, x,y)))
#(19, 11)
Upvotes: 3
Reputation: 328
Do you mean to get only one result instead of 2?
test = rdd.filter(lambda y: (y[0] >= y[1]) and ((y[0] < y[1])))
Upvotes: 1