How to count all values in one key of a pyspark RDD?

Question

In a pyspark RDD, 'predicted_values' is the key for the results of a logistic regression. Obviously, 'predicted_values' holds only 0 and 1.

I want to count the number of 0's and 1's in the output field.

I try:

Counter(rdd.groupByKey()['predicted_value'])

which gives

TypeError: 'PipelinedRDD' object is not subscriptable

What is the best way to do this?

pault · Accepted Answer

sorted(rdd.map(lambda x: x['predicted_value']).countByValue().items())
#[(0, 580), (1, 420)]

Answers (2)