Reputation: 109
PySpark Dataframe: adobeDF
Adding new columns to the dataframe:
from pyspark.sql.window import Window
from pyspark.sql import functions as f
adobeDF_new = adobeDF.withColumn('start_date', f.col('Date')).withColumn('end_date', f.col('Date'))
Result:
I am trying to figure out the code on how to save min(Date) values in start_date and max(Date) values in end_Date and grouping the final dataframe by post_evar10 and Type.
What I have tried:The below code works but want to see if there is a better way to do it and limit the data to 60 days from start_date
from pyspark.sql.window import Window
from pyspark.sql import functions as f
adobe_window = Window.partitionBy('post_evar10','Type').orderBy('Date')
adobeDF_new = adobeDF.withColumn('start_date', min(f.col('Date')).over(adobe_window)).withColumn('end_date', max(f.col('Date')).over(adobe_window))
Upvotes: 0
Views: 2883
Reputation: 1991
How about the following?
adobeDF.groupBy("post_evar10").agg(
f.min("start_date").alias("min_start"),
f.max("end_date").alias("max_end")
)
Upvotes: 2