mhn
mhn

Reputation: 2750

Column alias after groupBy in pyspark

I need the resulting data frame in the line below, to have an alias name "maxDiff" for the max('diff') column after groupBy. However, the below line does not makeany change, nor throw an error.

 grpdf = joined_df.groupBy(temp1.datestamp).max('diff').alias("maxDiff")

Upvotes: 56

Views: 114711

Answers (4)

Nilay Bhardwaj
Nilay Bhardwaj

Reputation: 39

you can use.

grouped_df = grpdf.select(col("max(diff)") as "maxdiff",col("sum(DIFF)") as "sumdiff").show()

Upvotes: 2

zero323
zero323

Reputation: 330453

You can use agg instead of calling max method:

from pyspark.sql.functions import max

joined_df.groupBy(temp1.datestamp).agg(max("diff").alias("maxDiff"))

Similarly in Scala

import org.apache.spark.sql.functions.max

joined_df.groupBy($"datestamp").agg(max("diff").alias("maxDiff"))

or

joined_df.groupBy($"datestamp").agg(max("diff").as("maxDiff"))

Upvotes: 107

vk1011
vk1011

Reputation: 7179

In addition to the answers already here, the following are also convenient ways if you know the name of the aggregated column, where you don't have to import from pyspark.sql.functions:

1

grouped_df = joined_df.groupBy(temp1.datestamp) \
                      .max('diff') \
                      .selectExpr('max(diff) AS maxDiff')

See docs for info on .selectExpr()

2

grouped_df = joined_df.groupBy(temp1.datestamp) \
                      .max('diff') \
                      .withColumnRenamed('max(diff)', 'maxDiff')

See docs for info on .withColumnRenamed()

This answer here goes into more detail: https://stackoverflow.com/a/34077809

Upvotes: 9

Nhor
Nhor

Reputation: 3950

This is because you are aliasing the whole DataFrame object, not Column. Here's an example how to alias the Column only:

import pyspark.sql.functions as func

grpdf = joined_df \
    .groupBy(temp1.datestamp) \
    .max('diff') \
    .select(func.col("max(diff)").alias("maxDiff"))

Upvotes: 45

Related Questions