Reputation: 2431
While trying to get year and week number of a range of dates spanning multiple years, I am getting into some issues with the start/end of the year.
I understand the logic for weeknumber
and the one of year
when they run separately. However, when they are combined, in some cases they don't bring consistent results and I was wondering what is the best way in Spark to make sure that those scenarios are handled with a consistent year for the given weeknumber,
For example, running:
spark.sql("select year('2017-01-01') as year, weekofyear('2017-01-01') as weeknumber").show(false)
outputs:
+----+----------+
|year|weeknumber|
+----+----------+
|2017|52 |
+----+----------+
But the wanted output would be:
+----+----------+
|year|weeknumber|
+----+----------+
|2016|52 |
+----+----------+
and running:
spark.sql("select year('2018-12-31') as year, weekofyear('2018-12-31') as weeknumber").show(false)
produces:
+----+----------+
|year|weeknumber|
+----+----------+
|2018|1 |
+----+----------+
But what is expected is:
+----+----------+
|year|weeknumber|
+----+----------+
|2019|1 |
+----+----------+
Code is running on Spark 2.4.2.
Upvotes: 1
Views: 1813
Reputation: 1568
You can also use a UDF to achieve this
import org.apache.spark.sql.types._
import java.time.temporal.IsoFields
def weekYear(date: java.sql.Date) : Option[Int] = {
if(date == null) None
else Some(date.toLocalDate.get(IsoFields.WEEK_BASED_YEAR))
}
Register this udf as
spark.udf.register("yearOfWeek", weekYear _)
Result:-
scala> spark.sql("select yearOfWeek('2017-01-01') as year, WEEKOFYEAR('2017-01-01') as weeknumber").show(false)
+----+----------+
|year|weeknumber|
+----+----------+
|2016|52 |
+----+----------+
Upvotes: 0
Reputation: 3173
val df = Seq(("2017-01-01"), ("2018-12-31")).toDF("dateval")
+----------+
| dateval|
+----------+
|2017-01-01|
|2018-12-31|
+----------+
df.createOrReplaceTempView("date_tab")
val newDF = spark.sql("""select dateval,
case when weekofyear(dateval)=1 and month(dateval)=12 then struct((year(dateval)+1) as yr, weekofyear(dateval) as wk)
when weekofyear(dateval)=52 and month(dateval)=1 then struct((year(dateval)-1) as yr, weekofyear(dateval) as wk)
else struct((year(dateval)) as yr, weekofyear(dateval) as wk) end as week_struct
from date_tab""");
newDF.select($"dateval", $"week_struct.yr", $"week_struct.wk").show()
+----------+----+---+
| dateval| yr| wk|
+----------+----+---+
|2017-01-01|2016| 52|
|2018-12-31|2019| 1|
+----------+----+---+
Upvotes: 1
Reputation: 1054
This spark behavior is consistent with the ISO 8601
definition. You can not change it. However there is a workaround I could think of.
You can first determine dayOfWeek
, and if it is less than 4, you increase the year by one, if it equals to 4 then keep the year untouched. Otherwise decrease the year by one.
Example with 2017-01-01
sql("select case when date_format('2017-01-01', 'u') < 4 then year('2017-01-01')+1 when date_format('2017-01-01', 'u') = 4 then year('2017-01-01') else year('2017-01-01')- 1 end as year, weekofyear('2017-01-01') as weeknumber, date_format('2017-01-01', 'u') as dayOfWeek").show(false)
+----+----------+---------+
|year|weeknumber|dayOfWeek|
+----+----------+---------+
|2016|52 |7 |
+----+----------+---------+
Example with 2018-12-31
sql("select case when date_format('2018-12-31', 'u') < 4 then year('2018-12-31')+1 when date_format('2018-12-31', 'u') = 4 then year('2018-12-31') else year('2018-12-31')- 1 end as year, weekofyear('2018-12-31') as weeknumber, date_format('2018-12-31', 'u') as dayOfWeek").show(false)
+----+----------+---------+
|year|weeknumber|dayOfWeek|
+----+----------+---------+
|2019|1 |1 |
+----+----------+---------+
Upvotes: 2