Reputation: 3722
I'm looking to groupBy
agg
on the below Spark dataframe and get the mean, max, and min of each of the col1, col2, col3 columns
sp = spark.createDataFrame([['a',2,4,5], ['a',4,7,7], ['b',6,0,9], ['b', 2, 4, 4], ['c', 4, 4, 9]], ['id', 'col1', 'col2','col3'])
+---+----+----+----+
| id|col1|col2|col3|
+---+----+----+----+
| a| 2| 4| 5|
| a| 4| 7| 7|
| b| 6| 0| 9|
| b| 2| 4| 4|
| c| 4| 4| 9|
+---+----+----+----+
I've tried sp.groupBy('id').agg({'*':'max'})
to even just get the max on all of it but running into an error.
I've tried sp.groupBy('id').agg({'col1':['max', 'min', 'mean']})
but this is more of a traditional Pandas way to do it but it doesn't work.
id max(col1) max(col2) max(col3) min(col1) min(col2) min(col3) mean(col1) ..
a 4 7 7 2 4 5 3
b 6 4 9 2 0 4 4
c 4 4 9 4 4 9 4
Upvotes: 1
Views: 10376
Reputation: 266
Try this:
%%pyspark
SP_agg = sp.groupBy(
sp.id.alias('identity')
).agg(
sum("col1").alias("Annual_col1"),
sum("col2").alias("Annual_col2"),
sum("col3").alias("Annual_col3"),
mean("col1").alias("mean_col1"),
mean("col2").alias("mean_col2"),
mean("col3").alias("mean_col3"),
min("col1").alias("min_col1"),
min("col2").alias("min_col2"),
min("col3").alias("min_col3"),
max("col1").alias("max_col1"),
max("col2").alias("max_col2"),
max("col3").alias("max_col3")
)
SP_agg.show(10)
Upvotes: 2