Reputation: 9
I've been trying to get extract only the last working day of a year from a set of dates I have. My problem here is, there is s condition for the last day of the year. If 31st Dec falls on Saturday or Sunday, then I expect the last day to be 30th Dec. Below is the code I've used where I've extracted the year from date to get a list of all years present in my dataset and then have hardcoded the dates 31st and month to be 12. It worked all fine for 2020 and 2021 but the issue happened for 2022 when 31st Dec was a Saturday and hence my code failed to extract any year end date for 2022.
Dates |
---|
31-01-2020 |
31-07-2020 |
31-08-2020 |
31-12-2020 |
31-07-2020 |
31-01-2021 |
31-05-2021 |
30-06-2021 |
31-07-2020 |
31-07-2021 |
31-08-2021 |
31-12-2021 |
31-01-2022 |
31-08-2022 |
30-09-2022 |
31-10-2022 |
30-12-2022 |
31-01-2023 |
df = df.withColumn("Year", F.year(F.col("Date")))
year_list = df.select("Year").distinct().collect()
i=0
for index, row in enumerate(sorted(all_years, key=lambda x:x.Year, reverse=False)):
i = index+1
ye_date = F.concate_ws(
"-", (F.lit(row["Year"])).cast(T.StringType()). F.lit("12"), F.lit("31")).cast(T.DateType())
ye_final_date = df.filter(F.col("Date") == ye_date
The expected output should be
YE |
---|
31-12-2020 |
31-12-2021 |
30-12-2022 |
Upvotes: 0
Views: 850
Reputation: 24722
You can create the last day of the year for each date, and then subtract days as needed (using date_sub()
), depending on the day of the week (using dayofweek()
)
(
df
.withColumn("last_day", F.to_date(F.concat(F.year(F.col("Date")),F.lit("12"),F.lit("31")), "yyyyMMdd"))
.withColumn(
"last_working_day",
F.when(F.dayofweek(F.col("last_day"))==1, F.date_sub(F.col("last_day"),2))
.when(F.dayofweek(F.col("last_day"))==7, F.date_sub(F.col("last_day"),1))
.otherwise(F.col("last_day"))
)
.drop("last_day")
)
Input (here sp
is your spark session):
data= [
{"Date":"31-01-2020"},
{"Date":"17-05-2021"},
{"Date":"23-11-2022"},
{"Date":"05-05-2023"}
]
df = sp.createDataFrame(data)
df=df.withColumn("Date", F.to_date(F.col("Date"), "dd-MM-yyyy"))
| Date|
+----------+
|2020-01-31|
|2021-05-17|
|2022-11-23|
|2023-05-05|
+----------+
Output:
| Date|last_working_day|
+----------+----------------+
|2020-01-31| 2020-12-31|
|2021-05-17| 2021-12-31|
|2022-11-23| 2022-12-30|
|2023-05-05| 2023-12-29|
+----------+----------------+
Upvotes: 1