Accessing a count value from a dataframe in pyspark

Question

I hope you can't help.

I have this dataframe, and I want to select, for example, the count of the prediction==4

Code: 
the_counts=df.select('prediction').groupby('prediction').count()
the_counts.show()


+----------+-----+
|prediction|count|
+----------+-----+
|         1|    8|
|         6|   14|
|         5|    5|
|         4|    8|
|         8|    5|
|         0|    6|
+----------+-----+

So, I can assign that value to a variable. As this will be within a loop that will run many iterations.

I managed this, but it's by creating a different dataframe, and then changing that datafram to a number.

dfva = the_counts.select('count').filter(the_counts.prediction ==6)
dfva.show()


+-----+
|count|
+-----+
|   14|
+-----+

Is there a way to access the number straight away without so many steps, or the most efficient way?

This is python 3.x and spark 2.1

Thank you very much

Suresh · Accepted Answer

you can first() method to take the value directly,

>>> dfva = the_counts.filter(the_counts['prediction'] == 6).first()['count']
>>> type(dfva)

>>> print(dfva)
14

Accessing a count value from a dataframe in pyspark

Answers (1)

Related Questions