Deno George
Deno George

Reputation: 362

filtering two RDD in pyspark

I want to do simple filtration here my RDD are

rdd1 = [96,104,112]

rdd2 = [112, 30, 384, 96, 288, 352, 104, 368]

So the result should be the RDD which contains elements from rdd2 which are not in rdd1

So it will look like ,

rdd3 = [30,384,288,352,368]

How should we achieve this

I tried this one,

 rdd3 = rdd1.map(lambda r: r != r in rdd2)

But this is not working. How to solve this,

Thanks in Advance

Upvotes: 1

Views: 342

Answers (1)

zero323
zero323

Reputation: 330093

You can use subtract method which:

Return each value in self that is not contained in other.

rdd1 = sc.parallelize([96,104,112])
rdd2 = sc.parallelize([112, 30, 384, 96, 288, 352, 104, 368])

rdd2.subtract(rdd1).collect()
## [384, 352, 368, 288, 30]

Upvotes: 5

Related Questions