NewCode
NewCode

Reputation: 109

How to add new column with min and max function in Pyspark and group by the data?

PySpark Dataframe: adobeDF

enter image description here

Adding new columns to the dataframe:

from pyspark.sql.window import Window
from pyspark.sql import functions as f
adobeDF_new = adobeDF.withColumn('start_date', f.col('Date')).withColumn('end_date', f.col('Date'))

Result:

enter image description here

I am trying to figure out the code on how to save min(Date) values in start_date and max(Date) values in end_Date and grouping the final dataframe by post_evar10 and Type.

What I have tried:The below code works but want to see if there is a better way to do it and limit the data to 60 days from start_date

from pyspark.sql.window import Window
from pyspark.sql import functions as f
adobe_window = Window.partitionBy('post_evar10','Type').orderBy('Date')
adobeDF_new = adobeDF.withColumn('start_date', min(f.col('Date')).over(adobe_window)).withColumn('end_date', max(f.col('Date')).over(adobe_window))

Upvotes: 0

Views: 2883

Answers (1)

Greg
Greg

Reputation: 1991

How about the following?

adobeDF.groupBy("post_evar10").agg(
    f.min("start_date").alias("min_start"),
    f.max("end_date").alias("max_end")
)

Upvotes: 2

Related Questions