kingledion
kingledion

Reputation: 2500

How to count all values in one key of a pyspark RDD?

In a pyspark RDD, 'predicted_values' is the key for the results of a logistic regression. Obviously, 'predicted_values' holds only 0 and 1.

I want to count the number of 0's and 1's in the output field.

I try:

Counter(rdd.groupByKey()['predicted_value'])

which gives

TypeError: 'PipelinedRDD' object is not subscriptable

What is the best way to do this?

Upvotes: 0

Views: 2617

Answers (2)

pault
pault

Reputation: 43504

You could also use countByValue():

sorted(rdd.map(lambda x: x['predicted_value']).countByValue().items())
#[(0, 580), (1, 420)]

Upvotes: 2

kingledion
kingledion

Reputation: 2500

It appears that this can be done by (using the Counter class from collection):

>>> Counter([i['predicted_value'] for i in rdd.collect()]

Counter({0: 580, 1: 420})

Upvotes: 0

Related Questions