Counts of field based on other field in a RDD in pyspark

Question

Using pyspark, I have an RDD which looks like this

[("a", 0), ("b", 1), ("a", 1), ("a", 0)]

What I'd like to do is building another RDD with the counts for the first field based on the third field. So effectively it would be:

[("a", 0, 2), ("a", 1, 1), ("b", 1, 1)]

which means that there are two instances of "a" with third field equal to 0 and there is one instance of "a" with third field equal to 1 and there is one instance of "b" with third field equal to 1.

I can obtain the different counts of first field easily by using a reduceByKey as

rdd = sc.parallelize([("a", 0, 2), ("a", 1, 1), ("b", 1, 1)])

.map(lambda row: (row[0], 1))

.reduceByKey(add)

but this would only give me the counts of "a" and "b" regardless of third field. How would I obtain this instead?

eliasah · Accepted Answer

If understood your question good, you are probably looking for something like this :

from operator import add

rdd = sc.parallelize([("a", 0), ("b", 1), ("a", 1), ("a", 0)])
        .map(lambda row: ((row[0],row[1]), 1))
        .reduceByKey(add)
        .map(lambda row : (row[0][0],row[0][1],row[1]))
print(rdd.collect())

# [('a', 1, 1), ('a', 0, 2), ('b', 1, 1)]

Counts of field based on other field in a RDD in pyspark

Answers (1)

Related Questions