Surender Raja
Surender Raja

Reputation: 3599

In Spark scala dataframe how do i get week end date based on week number

As per my business logic week start day is monday and week end day is sunday

I want to get week end date which is sunday based on week number , some year has 53 weeks , it is not working for 53rd week alone

Expected value for dsupp_trans_dt is 2021-01-03

but as per below code it is null

 scala>  case class Data(id:Int,weekNumber:String)
 defined class Data

 scala>  var stgDF = Seq(Data(100,"53/2020")).toDF()
 stgDF: org.apache.spark.sql.DataFrame = [id: int, weekNumber: string]

 scala>  val weekNumber = "53/2020"
 weekNumber: String = 53/2020

 scala>  val monthIdNum = "202001"
 monthIdNum: String = 202001

 scala>         val isLeapYearFunc = (year: Int) => (((year % 4) == 0) && !(
     |           ((year % 100) == 0) &&
     |             !((year % 400) == 0))
     |           )
  isLeapYearFunc: Int => Boolean = <function1>

 scala>  val isLeapYear = isLeapYearFunc(monthIdNum.substring(0,4).toInt)
 isLeapYear: Boolean = true

 scala>  val kafkaFilePeriod = "2053"
 kafkaFilePeriod: String = 2053

 scala> stgDF = stgDF.withColumn("year_week_number",lit(weekNumber)).withColumn("is_leap_year",lit(isLeapYear)).withColumn("dsupp_trans_dt",
 | when (col("is_leap_year") === true ,date_add(to_date(col("year_week_number"), "w/yyyy"),7)).otherwise(date_add(to_date(col("year_week_number"), "w/yyyy"),14)))
 stgDF: org.apache.spark.sql.DataFrame = [id: int, weekNumber: string ... 3 more fields]

 scala> stgDF.show(10,false)

 +---+----------+----------------+------------+--------------+
 |id |weekNumber|year_week_number|is_leap_year|dsupp_trans_dt|
 +---+----------+----------------+------------+--------------+
 |100|53/2020   |53/2020         |true        |null          |
 +---+----------+----------------+------------+--------------+

The same works fine for below

 scala> val weekNumber = "52/2020"
 weekNumber: String = 52/2020

  scala> stgDF = stgDF.withColumn("year_week_number",lit(weekNumber)).withColumn("is_leap_year",lit(isLeapYear)).withColumn("dsupp_trans_dt",
 | when (col("is_leap_year") === true ,date_add(to_date(col("year_week_number"), "w/yyyy"),7)).otherwise(date_add(to_date(col("year_week_number"), "w/yyyy"),14)))
  stgDF: org.apache.spark.sql.DataFrame = [id: int, weekNumber: string ... 3 more fields]

  scala> stgDF.show
  +---+----------+----------------+------------+--------------+
  | id|weekNumber|year_week_number|is_leap_year|dsupp_trans_dt|
  +---+----------+----------------+------------+--------------+
  |100|   53/2020|         52/2020|        true|    2020-12-27|
  +---+----------+----------------+------------+--------------+

Upvotes: 3

Views: 1316

Answers (3)

stack0114106
stack0114106

Reputation: 8711

'w/yyyy' will return the w-th Sunday of the year with the below caveat. Only if Jan-1st is Sunday you will get a valid date for 'w/yyyy' format for first week, otherwise it would be null. This can be confirmed using the below sql

spark.sql("""
with t1 ( select id , to_date(id||'/2020','w/yyyy') d1 from range(1,54) )
select *, date_format(d1,'EEEE') day from t1
""").show(54,false)

+---+----------+------+
|id |d1        |day   |
+---+----------+------+
|1  |null      |null  |
|2  |2020-01-05|Sunday|
|3  |2020-01-12|Sunday|
|4  |2020-01-19|Sunday|
|5  |2020-01-26|Sunday|
|6  |2020-02-02|Sunday|
|7  |2020-02-09|Sunday|
|8  |2020-02-16|Sunday|
|9  |2020-02-23|Sunday|
|10 |2020-03-01|Sunday|
|11 |2020-03-08|Sunday|
|12 |2020-03-15|Sunday|
|13 |2020-03-22|Sunday|
|14 |2020-03-29|Sunday|
|15 |2020-04-05|Sunday|
|16 |2020-04-12|Sunday|
|17 |2020-04-19|Sunday|
|18 |2020-04-26|Sunday|
|19 |2020-05-03|Sunday|
|20 |2020-05-10|Sunday|
|21 |2020-05-17|Sunday|
|22 |2020-05-24|Sunday|
|23 |2020-05-31|Sunday|
|24 |2020-06-07|Sunday|
|25 |2020-06-14|Sunday|
|26 |2020-06-21|Sunday|
|27 |2020-06-28|Sunday|
|28 |2020-07-05|Sunday|
|29 |2020-07-12|Sunday|
|30 |2020-07-19|Sunday|
|31 |2020-07-26|Sunday|
|32 |2020-08-02|Sunday|
|33 |2020-08-09|Sunday|
|34 |2020-08-16|Sunday|
|35 |2020-08-23|Sunday|
|36 |2020-08-30|Sunday|
|37 |2020-09-06|Sunday|
|38 |2020-09-13|Sunday|
|39 |2020-09-20|Sunday|
|40 |2020-09-27|Sunday|
|41 |2020-10-04|Sunday|
|42 |2020-10-11|Sunday|
|43 |2020-10-18|Sunday|
|44 |2020-10-25|Sunday|
|45 |2020-11-01|Sunday|
|46 |2020-11-08|Sunday|
|47 |2020-11-15|Sunday|
|48 |2020-11-22|Sunday|
|49 |2020-11-29|Sunday|
|50 |2020-12-06|Sunday|
|51 |2020-12-13|Sunday|
|52 |2020-12-20|Sunday|
|53 |null      |null  |
+---+----------+------+

The same query if you run it for year 2017, which starts with Sunday, you will get a valid value for 1/2017.

spark.sql("""
with t1 ( select id , to_date(id||'/2017','w/yyyy') d1 from range(1,54) )
select *, date_format(d1,'EEEE') day from t1
""").show(54,false)
+---+----------+------+
|id |d1        |day   |
+---+----------+------+
|1  |2017-01-01|Sunday|
|2  |2017-01-08|Sunday|
|3  |2017-01-15|Sunday|
.....
.....
|51 |2017-12-17|Sunday|
|52 |2017-12-24|Sunday|
|53 |null      |null  |
+---+----------+------+

So the chances are like once in 7 year, you will get a valid value for 53/xxxx. Here are the possibilities, starting from 2000

spark.sql("""
with t1 ( select id , to_date('53'||'/'||(2000+id),'w/yyyy') week53 from range(0,30) )
select id, 2000+id year, week53, date_format(week53,'EEEE') day, 
          date_format((2000+id)||'-01-01', 'DD-MMM-yyyy EEEE') first_day from t1
""").show(54,false)

+---+----+----------+------+---------------------+
|id |year|week53    |day   |first_day            |
+---+----+----------+------+---------------------+
|0  |2000|2000-12-24|Sunday|01-Jan-2000 Saturday |
|1  |2001|null      |null  |01-Jan-2001 Monday   |
|2  |2002|null      |null  |01-Jan-2002 Tuesday  |
|3  |2003|null      |null  |01-Jan-2003 Wednesday|
|4  |2004|null      |null  |01-Jan-2004 Thursday |
|5  |2005|2005-12-25|Sunday|01-Jan-2005 Saturday |
|6  |2006|null      |null  |01-Jan-2006 Sunday   |
|7  |2007|null      |null  |01-Jan-2007 Monday   |
|8  |2008|null      |null  |01-Jan-2008 Tuesday  |
|9  |2009|null      |null  |01-Jan-2009 Thursday |
|10 |2010|null      |null  |01-Jan-2010 Friday   |
|11 |2011|2011-12-25|Sunday|01-Jan-2011 Saturday |
|12 |2012|null      |null  |01-Jan-2012 Sunday   |
|13 |2013|null      |null  |01-Jan-2013 Tuesday  |
|14 |2014|null      |null  |01-Jan-2014 Wednesday|
|15 |2015|null      |null  |01-Jan-2015 Thursday |
|16 |2016|2016-12-25|Sunday|01-Jan-2016 Friday   |
|17 |2017|null      |null  |01-Jan-2017 Sunday   |
|18 |2018|null      |null  |01-Jan-2018 Monday   |
|19 |2019|null      |null  |01-Jan-2019 Tuesday  |
|20 |2020|null      |null  |01-Jan-2020 Wednesday|
|21 |2021|null      |null  |01-Jan-2021 Friday   |
|22 |2022|2022-12-25|Sunday|01-Jan-2022 Saturday |
|23 |2023|null      |null  |01-Jan-2023 Sunday   |
|24 |2024|null      |null  |01-Jan-2024 Monday   |
|25 |2025|null      |null  |01-Jan-2025 Wednesday|
|26 |2026|null      |null  |01-Jan-2026 Thursday |
|27 |2027|null      |null  |01-Jan-2027 Friday   |
|28 |2028|2028-12-24|Sunday|01-Jan-2028 Saturday |
|29 |2029|null      |null  |01-Jan-2029 Monday   |
+---+----+----------+------+---------------------+

The above query confirms that either the year has to start with Saturday or Friday to get 53 Sundays.

BTW, you can use the SQL logic itself to check the leap year. Just check if Dec-31 is 366th day of the year.

for ( year <- Array("2020","2021","1900","2000") )
{
spark.sql(s""" select ${year} year, 
 case when date_format('${year}'||'-12-31','DDD')=366 then 'leap_year' else 'normal_year' end check
 from range(1) """).show(false)
}   

+----+---------+
|year|check    |
+----+---------+
|2020|leap_year|
+----+---------+

+----+-----------+
|year|check      |
+----+-----------+
|2021|normal_year|
+----+-----------+

+----+-----------+
|year|check      |
+----+-----------+
|1900|normal_year|
+----+-----------+

+----+---------+
|year|check    |
+----+---------+
|2000|leap_year|
+----+---------+

Another way is to use dayofyear() function

for( year <- Array("2020","2021","1900","2000") )
{
spark.sql(s""" select ${year} year, 
 case when dayofyear('${year}'||'-12-31')=366 then 'leap_year' else 'normal_year' end check
 from range(1) """).show(false)
}  

Upvotes: 0

Vincent Doba
Vincent Doba

Reputation: 5068

You can use an user-defined function using new java time API.

First, you need to create a function that convert a String representing a week such as 53/2020 to the date of the Sunday of this week:

import java.time.LocalDate
import java.time.format.DateTimeFormatter

val toWeekDate = (weekNumber: String) => {
  LocalDate.parse("7/" + weekNumber, DateTimeFormatter.ofPattern("e/w/YYYY"))
}

Where, for element of date pattern (see DateTimeFormatter's documentation for more details):

  • e is for the day of week (1 for Monday to 7 for Sunday)
  • w is the week of year
  • YYYY is the week year: for instance, 01/01/2021 is in 2020 week year because it belongs to the 53rd week of 2020.

Then you convert it to an user-defined function and register it to your spark context:

import org.apache.spark.sql.functions.udf

val to_week_date = udf(toWeekDate)
spark.udf.register("to_week_date", to_week_date)

Finally you can use your user defined function when creating the new column:

import org.apache.spark.sql.functions.{col, lit}

val weekNumber = "53/2020"

stgDF
  .withColumn("year_week_number",lit(weekNumber))
  .withColumn("dsupp_trans_dt", to_week_date(col("year_week_number")))

Which gives you the following result:

+---+----------+----------------+--------------+
|id |weekNumber|year_week_number|dsupp_trans_dt|
+---+----------+----------------+--------------+
|100|53/2020   |53/2020         |2021-01-03    |
+---+----------+----------------+--------------+

Upvotes: 1

Bondarenko
Bondarenko

Reputation: 293

The documentation for weekofyear spark function has the answer:

Extracts the week number as an integer from a given date/timestamp/string. A week is considered to start on a Monday and week 1 is the first week with more than 3 days, as defined by ISO 8601.

It means that every year actually has 52 weeks plus n days, where n < 7. For that reason, to_date considers 53/2020 as an incorrect date and returns null. For the same reason, to_date considers 01/2020 as invalid date because 01/2020 is actually 53th week of 2019 year.

Upvotes: 0

Related Questions