Reputation: 2500
In a pyspark RDD, 'predicted_values' is the key for the results of a logistic regression. Obviously, 'predicted_values' holds only 0 and 1.
I want to count the number of 0's and 1's in the output field.
I try:
Counter(rdd.groupByKey()['predicted_value'])
which gives
TypeError: 'PipelinedRDD' object is not subscriptable
What is the best way to do this?
Upvotes: 0
Views: 2617
Reputation: 43504
You could also use countByValue()
:
sorted(rdd.map(lambda x: x['predicted_value']).countByValue().items())
#[(0, 580), (1, 420)]
Upvotes: 2
Reputation: 2500
It appears that this can be done by (using the Counter class from collection
):
>>> Counter([i['predicted_value'] for i in rdd.collect()]
Counter({0: 580, 1: 420})
Upvotes: 0