Matt W.
Matt W.

Reputation: 3722

group by agg multiple columns with pyspark

I'm looking to groupBy agg on the below Spark dataframe and get the mean, max, and min of each of the col1, col2, col3 columns

sp = spark.createDataFrame([['a',2,4,5], ['a',4,7,7], ['b',6,0,9], ['b', 2, 4, 4], ['c', 4, 4, 9]], ['id', 'col1', 'col2','col3'])

+---+----+----+----+
| id|col1|col2|col3|
+---+----+----+----+
|  a|   2|   4|   5|
|  a|   4|   7|   7|
|  b|   6|   0|   9|
|  b|   2|   4|   4|
|  c|   4|   4|   9|
+---+----+----+----+

I've tried sp.groupBy('id').agg({'*':'max'}) to even just get the max on all of it but running into an error.

I've tried sp.groupBy('id').agg({'col1':['max', 'min', 'mean']}) but this is more of a traditional Pandas way to do it but it doesn't work.

id  max(col1)  max(col2)  max(col3)  min(col1) min(col2) min(col3) mean(col1) ..
a   4          7          7          2         4         5         3   
b   6          4          9          2         0         4         4  
c   4          4          9          4         4         9         4  

Upvotes: 1

Views: 10376

Answers (1)

Loku
Loku

Reputation: 266

Try this:

%%pyspark
SP_agg = sp.groupBy(
    sp.id.alias('identity')
    ).agg(
        sum("col1").alias("Annual_col1"), 
        sum("col2").alias("Annual_col2"), 
        sum("col3").alias("Annual_col3"), 
        mean("col1").alias("mean_col1"), 
        mean("col2").alias("mean_col2"), 
        mean("col3").alias("mean_col3"), 
        min("col1").alias("min_col1"), 
        min("col2").alias("min_col2"), 
        min("col3").alias("min_col3"), 
        max("col1").alias("max_col1"), 
        max("col2").alias("max_col2"), 
        max("col3").alias("max_col3") 
        )
SP_agg.show(10)

Upvotes: 2

Related Questions