Christie Chen
Christie Chen

Reputation: 215

Pyspark - after groupByKey and count distinct value according to the key?

I would like to find how many distinct values according to the key, for example, suppose I have

x = sc.parallelize([("a", 1), ("b", 1), ("a", 1),  ("b", 2), ("a", 2)])

And I have done using groupByKey

sorted(x.groupByKey().map(lambda x : (x[0], list(x[1]))).collect())
x.groupByKey().mapValues(len).collect()

the output will by like,

[('a', [1, 1, 2]), ('b', [1, 2])]
[('a', 3), ('b', 2)]

However, I want to find distinct values in the list, the output should be like,

[('a', [1, 2]), ('b', [1, 2])]
[('a', 2), ('b', 2)]

I am very new to spark and try to apply the distinct() function somewhere, but all failed :-( Thanks a lot in advance!

Upvotes: 0

Views: 2313

Answers (2)

Rakesh Kumar
Rakesh Kumar

Reputation: 4420

You can try number of approaches for same. I solved it using below approach:-

from operator import add
x = sc.parallelize([("a", 1), ("b", 1), ("a", 1),  ("b", 2), ("a", 2)])
x = x.map(lambda n:((n[0],n[1]), 1))
x.groupByKey().map(lambda n:(n[0][0],1)).reduceByKey(add).collect()

OutPut:-

[('b', 2), ('a', 2)]

Hope This will help you.

Upvotes: 0

Pushkr
Pushkr

Reputation: 3619

you can use set instead of list -

sorted(x.groupByKey().map(lambda x : (x[0], set(x[1]))).collect())

Upvotes: 1

Related Questions