To get Date Series from Start and end date in Spark SQL

Question

I have to convert a date tuple which contains Start and End date to date series.

-+-----------------------------------------+
 |dateRange                                |
-+-----------------------------------------+
 |[2017-04-06 00:00:00,2017-04-05 00:00:00]|
 |[2017-04-05 00:00:00,2017-04-04 00:00:00]|
 |[2017-04-04 00:00:00,2017-04-03 00:00:00]|
 |[2017-04-03 00:00:00,2017-03-31 00:00:00]| 
 |[2017-03-31 00:00:00,2017-03-30 00:00:00]|
 |[2017-03-30 00:00:00,2017-03-29 00:00:00]|
 |[2017-03-29 00:00:00,2017-03-28 00:00:00]|
 |[2017-03-28 00:00:00,2017-03-27 00:00:00]|
 |[2017-04-06 00:00:00,2017-04-05 00:00:00]|
 |[2017-04-05 00:00:00,2017-04-04 00:00:00]|
 |[2017-04-04 00:00:00,2017-04-03 00:00:00]|
 |[2017-04-03 00:00:00,2017-03-31 00:00:00]|
 |[2017-03-31 00:00:00,2017-03-30 00:00:00]|
 |[2017-03-30 00:00:00,2017-03-29 00:00:00]|
 |[2017-03-29 00:00:00,2017-03-28 00:00:00]|
 |[2017-03-28 00:00:00,2017-03-27 00:00:00]|
 |[2017-04-06 00:00:00,2017-04-05 00:00:00]|
-+-----------------------------------------+

How can I convert these tuples to date series for 'to' to 'From' date?

|[2017-04-03 00:00:00,2017-03-31 00:00:00]|

after transformation should convert to

|[2017-04-03 00:00:00,2017-04-02 00:00:00,2017-04-01 00:00:00,2017-03-31 00:00:00]|

koiralo · Accepted Answer

Creating a UDF and calculating the dates between fromDate and toDate could solve the problem. I have used Joda Time API for the simplicity. You need to add the dependency for that one as

For SBT:

libraryDependencies += "joda-time" % "joda-time" % "2.8.1"

Below is the example for your problem

import spark.implicits._

    val data = spark.sparkContext.parallelize(Seq(
      ("2017-04-03 00:00:00,2017-03-31 00:00:00"),
      ("2017-03-31 00:00:00,2017-03-30 00:00:00"),
      ("2017-03-30 00:00:00,2017-03-29 00:00:00"),
      ("2017-03-29 00:00:00,2017-03-28 00:00:00"),
      ("2017-03-28 00:00:00,2017-03-27 00:00:00"),
      ("2017-04-03 00:00:00,2017-03-31 00:00:00"),
      ("2017-04-06 00:00:00,2017-04-05 00:00:00")
    )).toDF("dateRanges")


    val calculateDate = udf((date: String) => {

      val dtf = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss")

        val from = dtf.parseDateTime(date.split(",")(0)).toDateTime()
        val to   = dtf.parseDateTime(date.split(",")(1)).toDateTime()
        val dates = scala.collection.mutable.MutableList[String]()
        var toDate = to
        while(from.getMillis != toDate.getMillis){
          if (from.getMillis > toDate.getMillis){
            dates += from.toString(dtf)
            toDate = toDate.plusDays(1)
          }
          else {
            dates += from.toString(dtf)
            toDate = toDate.minusDays(1)
          }
        }
      dates
    })

    data.withColumn("newDate", calculateDate(data("dateRanges")))

This works for both case if your toDate is smaller or greater than fromDate.

Hope this helps!

To get Date Series from Start and end date in Spark SQL

Answers (2)

Related Questions