Reputation: 362
I want to do simple filtration here my RDD are
rdd1 = [96,104,112]
rdd2 = [112, 30, 384, 96, 288, 352, 104, 368]
So the result should be the RDD which contains elements from rdd2 which are not in rdd1
So it will look like ,
rdd3 = [30,384,288,352,368]
How should we achieve this
I tried this one,
rdd3 = rdd1.map(lambda r: r != r in rdd2)
But this is not working. How to solve this,
Thanks in Advance
Upvotes: 1
Views: 342
Reputation: 330093
You can use subtract
method which:
Return each value in self that is not contained in other.
rdd1 = sc.parallelize([96,104,112])
rdd2 = sc.parallelize([112, 30, 384, 96, 288, 352, 104, 368])
rdd2.subtract(rdd1).collect()
## [384, 352, 368, 288, 30]
Upvotes: 5