reign
reign

Reputation: 47

How to create multiple count columns in Pyspark?

I have a dataframe of title and bin:

+---------------------+-------------+
|                Title|          bin|        
+---------------------+-------------+
|  Forrest Gump (1994)|            3|
|  Pulp Fiction (1994)|            2|
|   Matrix, The (1999)|            3|
|     Toy Story (1995)|            1|                     
|    Fight Club (1999)|            3|
+---------------------+-------------+

How do I count the bin into each individual column of a new dataframe using Pyspark? For instance:

+------------+------------+------------+
| count(bin1)| count(bin2)| count(bin3)|      
+------------+------------+------------+
|           1|          1 |           3|
+------------+------------+------------+

Is this possible? Would someone please help me with this if you know how?

Upvotes: 1

Views: 411

Answers (1)

blackbishop
blackbishop

Reputation: 32640

Group by bin and count then pivot the column bin and rename the columns of resulting dataframe if you want:

import pyspark.sql.functions as F

df1 = df.groupBy("bin").count().groupBy().pivot("bin").agg(F.first("count"))

df1 = df1.toDF(*[f"count_bin{c}" for c in df1.columns])

df1.show()

#+----------+----------+----------+
#|count_bin1|count_bin2|count_bin3|
#+----------+----------+----------+
#|         1|         1|         3|
#+----------+----------+----------+

Upvotes: 2

Related Questions