Lalit Lakhotia
Lalit Lakhotia

Reputation: 265

To get Date Series from Start and end date in Spark SQL

I have to convert a date tuple which contains Start and End date to date series.

-+-----------------------------------------+
 |dateRange                                |
-+-----------------------------------------+
 |[2017-04-06 00:00:00,2017-04-05 00:00:00]|
 |[2017-04-05 00:00:00,2017-04-04 00:00:00]|
 |[2017-04-04 00:00:00,2017-04-03 00:00:00]|
 |[2017-04-03 00:00:00,2017-03-31 00:00:00]| 
 |[2017-03-31 00:00:00,2017-03-30 00:00:00]|
 |[2017-03-30 00:00:00,2017-03-29 00:00:00]|
 |[2017-03-29 00:00:00,2017-03-28 00:00:00]|
 |[2017-03-28 00:00:00,2017-03-27 00:00:00]|
 |[2017-04-06 00:00:00,2017-04-05 00:00:00]|
 |[2017-04-05 00:00:00,2017-04-04 00:00:00]|
 |[2017-04-04 00:00:00,2017-04-03 00:00:00]|
 |[2017-04-03 00:00:00,2017-03-31 00:00:00]|
 |[2017-03-31 00:00:00,2017-03-30 00:00:00]|
 |[2017-03-30 00:00:00,2017-03-29 00:00:00]|
 |[2017-03-29 00:00:00,2017-03-28 00:00:00]|
 |[2017-03-28 00:00:00,2017-03-27 00:00:00]|
 |[2017-04-06 00:00:00,2017-04-05 00:00:00]|
-+-----------------------------------------+

How can I convert these tuples to date series for 'to' to 'From' date?

|[2017-04-03 00:00:00,2017-03-31 00:00:00]|  

after transformation should convert to

|[2017-04-03 00:00:00,2017-04-02 00:00:00,2017-04-01 00:00:00,2017-03-31 00:00:00]|  

Upvotes: 1

Views: 2092

Answers (2)

koiralo
koiralo

Reputation: 23109

Creating a UDF and calculating the dates between fromDate and toDate could solve the problem. I have used Joda Time API for the simplicity. You need to add the dependency for that one as

For SBT:

libraryDependencies += "joda-time" % "joda-time" % "2.8.1"

Below is the example for your problem

import spark.implicits._

    val data = spark.sparkContext.parallelize(Seq(
      ("2017-04-03 00:00:00,2017-03-31 00:00:00"),
      ("2017-03-31 00:00:00,2017-03-30 00:00:00"),
      ("2017-03-30 00:00:00,2017-03-29 00:00:00"),
      ("2017-03-29 00:00:00,2017-03-28 00:00:00"),
      ("2017-03-28 00:00:00,2017-03-27 00:00:00"),
      ("2017-04-03 00:00:00,2017-03-31 00:00:00"),
      ("2017-04-06 00:00:00,2017-04-05 00:00:00")
    )).toDF("dateRanges")


    val calculateDate = udf((date: String) => {

      val dtf = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss")

        val from = dtf.parseDateTime(date.split(",")(0)).toDateTime()
        val to   = dtf.parseDateTime(date.split(",")(1)).toDateTime()
        val dates = scala.collection.mutable.MutableList[String]()
        var toDate = to
        while(from.getMillis != toDate.getMillis){
          if (from.getMillis > toDate.getMillis){
            dates += from.toString(dtf)
            toDate = toDate.plusDays(1)
          }
          else {
            dates += from.toString(dtf)
            toDate = toDate.minusDays(1)
          }
        }
      dates
    })

    data.withColumn("newDate", calculateDate(data("dateRanges")))

This works for both case if your toDate is smaller or greater than fromDate.

Hope this helps!

Upvotes: 1

Lalit Lakhotia
Lalit Lakhotia

Reputation: 265

I have tried below code snippet and its working for me.

  import org.apache.spark.sql.functions._
  import org.joda.time.LocalDate
  def dayIterator(start: LocalDate, end: LocalDate) = Iterator.iterate(start)(_ plusDays 1) takeWhile (_ isBefore end)

  def dateSeries( date1 : String,date2 : String) : Array[String]= {
    val fromDate = new LocalDate(date1.split(" ")(0))
    val toDate = new LocalDate(date2.split(" ")(0))
    val series = dayIterator(fromDate,toDate).toArray
    val arr = series.map(a => a.toString() + " 00:00:00.0")
    arr
  }

  val DateSeries = udf(dateSeries(_: String, _ : String))


scala> dateSeries("2017-03-31 00:00:00.0","2017-04-03 00:00:00.0"
res53: Array[String] = Array(2017-03-31, 2017-04-01, 2017-04-02)

No i am not able to figure out even after appending " 00:00:00.0" in map operation in dateSeries method . Array it return don't have that appended string.

Upvotes: 2

Related Questions