How to calculate the counts of each distinct value in a pyspark dataframe?

Question

I have a column filled with a bunch of states' initials as strings. My goal is to how the count of each state in such list.

For example: (("TX":3),("NJ":2)) should be the output when there are two occurrences of "TX" and "NJ".

I'm fairly new to pyspark so I'm stumped with this problem. Any help would be much appreciated.

eddies · Accepted Answer

I think you're looking to use the DataFrame idiom of groupBy and count.

For example, given the following dataframe, one state per row:

df = sqlContext.createDataFrame([('TX',), ('NJ',), ('TX',), ('CA',), ('NJ',)], ('state',))
df.show()
+-----+
|state|
+-----+
|   TX|
|   NJ|
|   TX|
|   CA|
|   NJ|
+-----+

The following yields:

df.groupBy('state').count().show()
+-----+-----+
|state|count|
+-----+-----+
|   TX|    2|
|   NJ|    2|
|   CA|    1|
+-----+-----+

How to calculate the counts of each distinct value in a pyspark dataframe?

Answers (2)

Related Questions