Reputation: 3599
As per my business logic week start day is monday and week end day is sunday
I want to get week end date which is sunday based on week number , some year has 53 weeks , it is not working for 53rd week alone
Expected value for dsupp_trans_dt is 2021-01-03
but as per below code it is null
scala> case class Data(id:Int,weekNumber:String)
defined class Data
scala> var stgDF = Seq(Data(100,"53/2020")).toDF()
stgDF: org.apache.spark.sql.DataFrame = [id: int, weekNumber: string]
scala> val weekNumber = "53/2020"
weekNumber: String = 53/2020
scala> val monthIdNum = "202001"
monthIdNum: String = 202001
scala> val isLeapYearFunc = (year: Int) => (((year % 4) == 0) && !(
| ((year % 100) == 0) &&
| !((year % 400) == 0))
| )
isLeapYearFunc: Int => Boolean = <function1>
scala> val isLeapYear = isLeapYearFunc(monthIdNum.substring(0,4).toInt)
isLeapYear: Boolean = true
scala> val kafkaFilePeriod = "2053"
kafkaFilePeriod: String = 2053
scala> stgDF = stgDF.withColumn("year_week_number",lit(weekNumber)).withColumn("is_leap_year",lit(isLeapYear)).withColumn("dsupp_trans_dt",
| when (col("is_leap_year") === true ,date_add(to_date(col("year_week_number"), "w/yyyy"),7)).otherwise(date_add(to_date(col("year_week_number"), "w/yyyy"),14)))
stgDF: org.apache.spark.sql.DataFrame = [id: int, weekNumber: string ... 3 more fields]
scala> stgDF.show(10,false)
+---+----------+----------------+------------+--------------+
|id |weekNumber|year_week_number|is_leap_year|dsupp_trans_dt|
+---+----------+----------------+------------+--------------+
|100|53/2020 |53/2020 |true |null |
+---+----------+----------------+------------+--------------+
The same works fine for below
scala> val weekNumber = "52/2020"
weekNumber: String = 52/2020
scala> stgDF = stgDF.withColumn("year_week_number",lit(weekNumber)).withColumn("is_leap_year",lit(isLeapYear)).withColumn("dsupp_trans_dt",
| when (col("is_leap_year") === true ,date_add(to_date(col("year_week_number"), "w/yyyy"),7)).otherwise(date_add(to_date(col("year_week_number"), "w/yyyy"),14)))
stgDF: org.apache.spark.sql.DataFrame = [id: int, weekNumber: string ... 3 more fields]
scala> stgDF.show
+---+----------+----------------+------------+--------------+
| id|weekNumber|year_week_number|is_leap_year|dsupp_trans_dt|
+---+----------+----------------+------------+--------------+
|100| 53/2020| 52/2020| true| 2020-12-27|
+---+----------+----------------+------------+--------------+
Upvotes: 3
Views: 1316
Reputation: 8711
'w/yyyy' will return the w-th Sunday of the year with the below caveat. Only if Jan-1st is Sunday you will get a valid date for 'w/yyyy' format for first week, otherwise it would be null. This can be confirmed using the below sql
spark.sql("""
with t1 ( select id , to_date(id||'/2020','w/yyyy') d1 from range(1,54) )
select *, date_format(d1,'EEEE') day from t1
""").show(54,false)
+---+----------+------+
|id |d1 |day |
+---+----------+------+
|1 |null |null |
|2 |2020-01-05|Sunday|
|3 |2020-01-12|Sunday|
|4 |2020-01-19|Sunday|
|5 |2020-01-26|Sunday|
|6 |2020-02-02|Sunday|
|7 |2020-02-09|Sunday|
|8 |2020-02-16|Sunday|
|9 |2020-02-23|Sunday|
|10 |2020-03-01|Sunday|
|11 |2020-03-08|Sunday|
|12 |2020-03-15|Sunday|
|13 |2020-03-22|Sunday|
|14 |2020-03-29|Sunday|
|15 |2020-04-05|Sunday|
|16 |2020-04-12|Sunday|
|17 |2020-04-19|Sunday|
|18 |2020-04-26|Sunday|
|19 |2020-05-03|Sunday|
|20 |2020-05-10|Sunday|
|21 |2020-05-17|Sunday|
|22 |2020-05-24|Sunday|
|23 |2020-05-31|Sunday|
|24 |2020-06-07|Sunday|
|25 |2020-06-14|Sunday|
|26 |2020-06-21|Sunday|
|27 |2020-06-28|Sunday|
|28 |2020-07-05|Sunday|
|29 |2020-07-12|Sunday|
|30 |2020-07-19|Sunday|
|31 |2020-07-26|Sunday|
|32 |2020-08-02|Sunday|
|33 |2020-08-09|Sunday|
|34 |2020-08-16|Sunday|
|35 |2020-08-23|Sunday|
|36 |2020-08-30|Sunday|
|37 |2020-09-06|Sunday|
|38 |2020-09-13|Sunday|
|39 |2020-09-20|Sunday|
|40 |2020-09-27|Sunday|
|41 |2020-10-04|Sunday|
|42 |2020-10-11|Sunday|
|43 |2020-10-18|Sunday|
|44 |2020-10-25|Sunday|
|45 |2020-11-01|Sunday|
|46 |2020-11-08|Sunday|
|47 |2020-11-15|Sunday|
|48 |2020-11-22|Sunday|
|49 |2020-11-29|Sunday|
|50 |2020-12-06|Sunday|
|51 |2020-12-13|Sunday|
|52 |2020-12-20|Sunday|
|53 |null |null |
+---+----------+------+
The same query if you run it for year 2017, which starts with Sunday, you will get a valid value for 1/2017.
spark.sql("""
with t1 ( select id , to_date(id||'/2017','w/yyyy') d1 from range(1,54) )
select *, date_format(d1,'EEEE') day from t1
""").show(54,false)
+---+----------+------+
|id |d1 |day |
+---+----------+------+
|1 |2017-01-01|Sunday|
|2 |2017-01-08|Sunday|
|3 |2017-01-15|Sunday|
.....
.....
|51 |2017-12-17|Sunday|
|52 |2017-12-24|Sunday|
|53 |null |null |
+---+----------+------+
So the chances are like once in 7 year, you will get a valid value for 53/xxxx. Here are the possibilities, starting from 2000
spark.sql("""
with t1 ( select id , to_date('53'||'/'||(2000+id),'w/yyyy') week53 from range(0,30) )
select id, 2000+id year, week53, date_format(week53,'EEEE') day,
date_format((2000+id)||'-01-01', 'DD-MMM-yyyy EEEE') first_day from t1
""").show(54,false)
+---+----+----------+------+---------------------+
|id |year|week53 |day |first_day |
+---+----+----------+------+---------------------+
|0 |2000|2000-12-24|Sunday|01-Jan-2000 Saturday |
|1 |2001|null |null |01-Jan-2001 Monday |
|2 |2002|null |null |01-Jan-2002 Tuesday |
|3 |2003|null |null |01-Jan-2003 Wednesday|
|4 |2004|null |null |01-Jan-2004 Thursday |
|5 |2005|2005-12-25|Sunday|01-Jan-2005 Saturday |
|6 |2006|null |null |01-Jan-2006 Sunday |
|7 |2007|null |null |01-Jan-2007 Monday |
|8 |2008|null |null |01-Jan-2008 Tuesday |
|9 |2009|null |null |01-Jan-2009 Thursday |
|10 |2010|null |null |01-Jan-2010 Friday |
|11 |2011|2011-12-25|Sunday|01-Jan-2011 Saturday |
|12 |2012|null |null |01-Jan-2012 Sunday |
|13 |2013|null |null |01-Jan-2013 Tuesday |
|14 |2014|null |null |01-Jan-2014 Wednesday|
|15 |2015|null |null |01-Jan-2015 Thursday |
|16 |2016|2016-12-25|Sunday|01-Jan-2016 Friday |
|17 |2017|null |null |01-Jan-2017 Sunday |
|18 |2018|null |null |01-Jan-2018 Monday |
|19 |2019|null |null |01-Jan-2019 Tuesday |
|20 |2020|null |null |01-Jan-2020 Wednesday|
|21 |2021|null |null |01-Jan-2021 Friday |
|22 |2022|2022-12-25|Sunday|01-Jan-2022 Saturday |
|23 |2023|null |null |01-Jan-2023 Sunday |
|24 |2024|null |null |01-Jan-2024 Monday |
|25 |2025|null |null |01-Jan-2025 Wednesday|
|26 |2026|null |null |01-Jan-2026 Thursday |
|27 |2027|null |null |01-Jan-2027 Friday |
|28 |2028|2028-12-24|Sunday|01-Jan-2028 Saturday |
|29 |2029|null |null |01-Jan-2029 Monday |
+---+----+----------+------+---------------------+
The above query confirms that either the year has to start with Saturday or Friday to get 53 Sundays.
BTW, you can use the SQL logic itself to check the leap year. Just check if Dec-31 is 366th day of the year.
for ( year <- Array("2020","2021","1900","2000") )
{
spark.sql(s""" select ${year} year,
case when date_format('${year}'||'-12-31','DDD')=366 then 'leap_year' else 'normal_year' end check
from range(1) """).show(false)
}
+----+---------+
|year|check |
+----+---------+
|2020|leap_year|
+----+---------+
+----+-----------+
|year|check |
+----+-----------+
|2021|normal_year|
+----+-----------+
+----+-----------+
|year|check |
+----+-----------+
|1900|normal_year|
+----+-----------+
+----+---------+
|year|check |
+----+---------+
|2000|leap_year|
+----+---------+
Another way is to use dayofyear() function
for( year <- Array("2020","2021","1900","2000") )
{
spark.sql(s""" select ${year} year,
case when dayofyear('${year}'||'-12-31')=366 then 'leap_year' else 'normal_year' end check
from range(1) """).show(false)
}
Upvotes: 0
Reputation: 5068
You can use an user-defined function using new java time API.
First, you need to create a function that convert a String representing a week such as 53/2020
to the date of the Sunday of this week:
import java.time.LocalDate
import java.time.format.DateTimeFormatter
val toWeekDate = (weekNumber: String) => {
LocalDate.parse("7/" + weekNumber, DateTimeFormatter.ofPattern("e/w/YYYY"))
}
Where, for element of date pattern (see DateTimeFormatter's documentation for more details):
e
is for the day of week (1 for Monday to 7 for Sunday)w
is the week of yearYYYY
is the week year: for instance, 01/01/2021 is in 2020 week year because it belongs to the 53rd week of 2020.Then you convert it to an user-defined function and register it to your spark context:
import org.apache.spark.sql.functions.udf
val to_week_date = udf(toWeekDate)
spark.udf.register("to_week_date", to_week_date)
Finally you can use your user defined function when creating the new column:
import org.apache.spark.sql.functions.{col, lit}
val weekNumber = "53/2020"
stgDF
.withColumn("year_week_number",lit(weekNumber))
.withColumn("dsupp_trans_dt", to_week_date(col("year_week_number")))
Which gives you the following result:
+---+----------+----------------+--------------+
|id |weekNumber|year_week_number|dsupp_trans_dt|
+---+----------+----------------+--------------+
|100|53/2020 |53/2020 |2021-01-03 |
+---+----------+----------------+--------------+
Upvotes: 1
Reputation: 293
The documentation for weekofyear
spark function has the answer:
Extracts the week number as an integer from a given date/timestamp/string. A week is considered to start on a Monday and week 1 is the first week with more than 3 days, as defined by ISO 8601.
It means that every year actually has 52 weeks plus n
days, where n < 7
.
For that reason, to_date
considers 53/2020
as an incorrect date and returns null
. For the same reason, to_date
considers 01/2020
as invalid date because 01/2020
is actually 53th
week of 2019
year.
Upvotes: 0