Reputation: 215
I would like to find how many distinct values according to the key, for example, suppose I have
x = sc.parallelize([("a", 1), ("b", 1), ("a", 1), ("b", 2), ("a", 2)])
And I have done using groupByKey
sorted(x.groupByKey().map(lambda x : (x[0], list(x[1]))).collect())
x.groupByKey().mapValues(len).collect()
the output will by like,
[('a', [1, 1, 2]), ('b', [1, 2])]
[('a', 3), ('b', 2)]
However, I want to find distinct values in the list, the output should be like,
[('a', [1, 2]), ('b', [1, 2])]
[('a', 2), ('b', 2)]
I am very new to spark and try to apply the distinct() function somewhere, but all failed :-( Thanks a lot in advance!
Upvotes: 0
Views: 2313
Reputation: 4420
You can try number of approaches for same. I solved it using below approach:-
from operator import add
x = sc.parallelize([("a", 1), ("b", 1), ("a", 1), ("b", 2), ("a", 2)])
x = x.map(lambda n:((n[0],n[1]), 1))
x.groupByKey().map(lambda n:(n[0][0],1)).reduceByKey(add).collect()
OutPut:-
[('b', 2), ('a', 2)]
Hope This will help you.
Upvotes: 0
Reputation: 3619
you can use set instead of list -
sorted(x.groupByKey().map(lambda x : (x[0], set(x[1]))).collect())
Upvotes: 1