datapug
datapug

Reputation: 2431

How to get year and week number aligned for a date

While trying to get year and week number of a range of dates spanning multiple years, I am getting into some issues with the start/end of the year.

I understand the logic for weeknumber and the one of year when they run separately. However, when they are combined, in some cases they don't bring consistent results and I was wondering what is the best way in Spark to make sure that those scenarios are handled with a consistent year for the given weeknumber,

For example, running:

spark.sql("select year('2017-01-01') as year, weekofyear('2017-01-01') as weeknumber").show(false)

outputs:

+----+----------+
|year|weeknumber|
+----+----------+
|2017|52        |
+----+----------+

But the wanted output would be:

+----+----------+
|year|weeknumber|
+----+----------+
|2016|52        |
+----+----------+

and running:

spark.sql("select year('2018-12-31') as year, weekofyear('2018-12-31') as weeknumber").show(false)

produces:

+----+----------+
|year|weeknumber|
+----+----------+
|2018|1         |
+----+----------+

But what is expected is:

+----+----------+
|year|weeknumber|
+----+----------+
|2019|1         |
+----+----------+

Code is running on Spark 2.4.2.

Upvotes: 1

Views: 1813

Answers (3)

Puneeth Reddy V
Puneeth Reddy V

Reputation: 1568

You can also use a UDF to achieve this

import org.apache.spark.sql.types._
import java.time.temporal.IsoFields

def weekYear(date: java.sql.Date) : Option[Int] = {
     if(date == null) None
    else Some(date.toLocalDate.get(IsoFields.WEEK_BASED_YEAR))
}

Register this udf as

spark.udf.register("yearOfWeek", weekYear _)

Result:-

scala> spark.sql("select yearOfWeek('2017-01-01') as year, WEEKOFYEAR('2017-01-01') as weeknumber").show(false)
+----+----------+
|year|weeknumber|
+----+----------+
|2016|52        |
+----+----------+

Upvotes: 0

suresiva
suresiva

Reputation: 3173

val df = Seq(("2017-01-01"), ("2018-12-31")).toDF("dateval")

+----------+
|   dateval|
+----------+
|2017-01-01|
|2018-12-31|
+----------+

df.createOrReplaceTempView("date_tab")

val newDF = spark.sql("""select dateval, 
                    case when weekofyear(dateval)=1 and month(dateval)=12 then struct((year(dateval)+1) as yr, weekofyear(dateval) as wk)
                         when weekofyear(dateval)=52 and month(dateval)=1 then struct((year(dateval)-1) as yr, weekofyear(dateval) as wk)
                         else struct((year(dateval)) as yr, weekofyear(dateval) as wk) end as week_struct
             from date_tab""");

newDF.select($"dateval", $"week_struct.yr", $"week_struct.wk").show()

+----------+----+---+
|   dateval|  yr| wk|
+----------+----+---+
|2017-01-01|2016| 52|
|2018-12-31|2019|  1|
+----------+----+---+

Upvotes: 1

Shantanu Kher
Shantanu Kher

Reputation: 1054

This spark behavior is consistent with the ISO 8601 definition. You can not change it. However there is a workaround I could think of.

You can first determine dayOfWeek, and if it is less than 4, you increase the year by one, if it equals to 4 then keep the year untouched. Otherwise decrease the year by one.

Example with 2017-01-01

sql("select case when date_format('2017-01-01', 'u') < 4 then year('2017-01-01')+1 when date_format('2017-01-01', 'u') = 4 then year('2017-01-01') else year('2017-01-01')- 1 end as year, weekofyear('2017-01-01') as weeknumber, date_format('2017-01-01', 'u') as dayOfWeek").show(false)

+----+----------+---------+
|year|weeknumber|dayOfWeek|
+----+----------+---------+
|2016|52        |7        |
+----+----------+---------+

Example with 2018-12-31

sql("select case when date_format('2018-12-31', 'u') < 4 then year('2018-12-31')+1 when date_format('2018-12-31', 'u') = 4 then year('2018-12-31') else year('2018-12-31')- 1 end as year, weekofyear('2018-12-31') as weeknumber, date_format('2018-12-31', 'u') as dayOfWeek").show(false)

+----+----------+---------+
|year|weeknumber|dayOfWeek|
+----+----------+---------+
|2019|1         |1        |
+----+----------+---------+

Upvotes: 2

Related Questions