Saba Far
Saba Far

Reputation: 133

Creating multiple columns for a grouped pyspark dataframe

I'm trying to add several new columns to my dataframe (preferably in a for loop), with each new column being the count of certain instances of col B, after grouping by column A.

What doesn't work:

import functions as f
#the first one will be fine
df_grouped=df.select('A','B').filter(df.B=='a').groupBy('A').count()
df_grouped.show()
+---+-----+
| A |count|
+---+-----+
|859|    4|
|947|    2|
|282|    6|
|699|   24|
|153|   12|

# create the second column:
df_g2=df.select('A','B').filter(df.B=='b').groupBy('A').count() 
df_g2.show()
+---+-----+
| A |count|
+---+-----+
|174|   18|
|153|   20|
|630|    6|
|147|   16|

#I get an error on adding the new column:
df_grouped=df_grouped.withColumn('2nd_count',f.col(df_g2.select('count')))

The error:

AttributeError: 'DataFrame' object has no attribute '_get_object_id'

I also tried it without using f.col, and with just df_g2.count, but I get an error saying "col should be column".

Something that DOES work:

df_g1=df.select('A','B').filter(df.B=='a').groupBy('A').count()
df_g2=df.select('A','B').filter(df.B=='b').groupBy('A').count()
df_grouped=df_g1.join(df_g2,['A'])  

However, I'm going to add up to around 1000 new columns, and having that so many joins seems costly. I wonder if doing joins is inevitable, given that every time I group by col A, its order changes in the grouped object (e.g. compare order of column A in df_grouped with its order in df_g2 in above), or there is a better way to do this.

Upvotes: 0

Views: 3212

Answers (1)

Ala Tarighati
Ala Tarighati

Reputation: 3817

What you probably need is groupby and pivot. Try this:

df.groupby('A').pivot('B').agg(F.count('B')).show()

Upvotes: 1

Related Questions