Reputation: 2750
I need the resulting data frame in the line below, to have an alias name "maxDiff" for the max('diff') column after groupBy. However, the below line does not makeany change, nor throw an error.
grpdf = joined_df.groupBy(temp1.datestamp).max('diff').alias("maxDiff")
Upvotes: 56
Views: 114711
Reputation: 39
you can use.
grouped_df = grpdf.select(col("max(diff)") as "maxdiff",col("sum(DIFF)") as "sumdiff").show()
Upvotes: 2
Reputation: 330453
You can use agg
instead of calling max
method:
from pyspark.sql.functions import max
joined_df.groupBy(temp1.datestamp).agg(max("diff").alias("maxDiff"))
Similarly in Scala
import org.apache.spark.sql.functions.max
joined_df.groupBy($"datestamp").agg(max("diff").alias("maxDiff"))
or
joined_df.groupBy($"datestamp").agg(max("diff").as("maxDiff"))
Upvotes: 107
Reputation: 7179
In addition to the answers already here, the following are also convenient ways if you know the name of the aggregated column, where you don't have to import from pyspark.sql.functions
:
1
grouped_df = joined_df.groupBy(temp1.datestamp) \
.max('diff') \
.selectExpr('max(diff) AS maxDiff')
See docs for info on .selectExpr()
2
grouped_df = joined_df.groupBy(temp1.datestamp) \
.max('diff') \
.withColumnRenamed('max(diff)', 'maxDiff')
See docs for info on .withColumnRenamed()
This answer here goes into more detail: https://stackoverflow.com/a/34077809
Upvotes: 9
Reputation: 3950
This is because you are aliasing the whole DataFrame
object, not Column
. Here's an example how to alias the Column
only:
import pyspark.sql.functions as func
grpdf = joined_df \
.groupBy(temp1.datestamp) \
.max('diff') \
.select(func.col("max(diff)").alias("maxDiff"))
Upvotes: 45