Reputation: 9804
Using pyspark, I have an RDD which looks like this
[("a", 0), ("b", 1), ("a", 1), ("a", 0)]
What I'd like to do is building another RDD with the counts for the first field based on the third field. So effectively it would be:
[("a", 0, 2), ("a", 1, 1), ("b", 1, 1)]
which means that there are two instances of "a" with third field equal to 0 and there is one instance of "a" with third field equal to 1 and there is one instance of "b" with third field equal to 1.
I can obtain the different counts of first field easily by using a reduceByKey as
rdd = sc.parallelize([("a", 0, 2), ("a", 1, 1), ("b", 1, 1)])
.map(lambda row: (row[0], 1))
.reduceByKey(add)
but this would only give me the counts of "a" and "b" regardless of third field. How would I obtain this instead?
Upvotes: 1
Views: 153
Reputation: 40380
If understood your question good, you are probably looking for something like this :
from operator import add
rdd = sc.parallelize([("a", 0), ("b", 1), ("a", 1), ("a", 0)])
.map(lambda row: ((row[0],row[1]), 1))
.reduceByKey(add)
.map(lambda row : (row[0][0],row[0][1],row[1]))
print(rdd.collect())
# [('a', 1, 1), ('a', 0, 2), ('b', 1, 1)]
Upvotes: 2