Reputation: 133
I'm trying to add several new columns to my dataframe (preferably in a for loop), with each new column being the count of certain instances of col B
, after grouping by column A
.
What doesn't work:
import functions as f
#the first one will be fine
df_grouped=df.select('A','B').filter(df.B=='a').groupBy('A').count()
df_grouped.show()
+---+-----+
| A |count|
+---+-----+
|859| 4|
|947| 2|
|282| 6|
|699| 24|
|153| 12|
# create the second column:
df_g2=df.select('A','B').filter(df.B=='b').groupBy('A').count()
df_g2.show()
+---+-----+
| A |count|
+---+-----+
|174| 18|
|153| 20|
|630| 6|
|147| 16|
#I get an error on adding the new column:
df_grouped=df_grouped.withColumn('2nd_count',f.col(df_g2.select('count')))
The error:
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
I also tried it without using f.col
, and with just df_g2.count
, but I get an error saying "col should be column".
Something that DOES work:
df_g1=df.select('A','B').filter(df.B=='a').groupBy('A').count()
df_g2=df.select('A','B').filter(df.B=='b').groupBy('A').count()
df_grouped=df_g1.join(df_g2,['A'])
However, I'm going to add up to around 1000 new columns, and having that so many joins seems costly. I wonder if doing joins is inevitable, given that every time I group by col A
, its order changes in the grouped object (e.g. compare order of column A
in df_grouped with its order in df_g2
in above), or there is a better way to do this.
Upvotes: 0
Views: 3212
Reputation: 3817
What you probably need is groupby
and pivot
.
Try this:
df.groupby('A').pivot('B').agg(F.count('B')).show()
Upvotes: 1