Reputation: 265
I have to convert a date tuple which contains Start and End date to date series.
-+-----------------------------------------+
|dateRange |
-+-----------------------------------------+
|[2017-04-06 00:00:00,2017-04-05 00:00:00]|
|[2017-04-05 00:00:00,2017-04-04 00:00:00]|
|[2017-04-04 00:00:00,2017-04-03 00:00:00]|
|[2017-04-03 00:00:00,2017-03-31 00:00:00]|
|[2017-03-31 00:00:00,2017-03-30 00:00:00]|
|[2017-03-30 00:00:00,2017-03-29 00:00:00]|
|[2017-03-29 00:00:00,2017-03-28 00:00:00]|
|[2017-03-28 00:00:00,2017-03-27 00:00:00]|
|[2017-04-06 00:00:00,2017-04-05 00:00:00]|
|[2017-04-05 00:00:00,2017-04-04 00:00:00]|
|[2017-04-04 00:00:00,2017-04-03 00:00:00]|
|[2017-04-03 00:00:00,2017-03-31 00:00:00]|
|[2017-03-31 00:00:00,2017-03-30 00:00:00]|
|[2017-03-30 00:00:00,2017-03-29 00:00:00]|
|[2017-03-29 00:00:00,2017-03-28 00:00:00]|
|[2017-03-28 00:00:00,2017-03-27 00:00:00]|
|[2017-04-06 00:00:00,2017-04-05 00:00:00]|
-+-----------------------------------------+
How can I convert these tuples to date series for 'to' to 'From' date?
|[2017-04-03 00:00:00,2017-03-31 00:00:00]|
after transformation should convert to
|[2017-04-03 00:00:00,2017-04-02 00:00:00,2017-04-01 00:00:00,2017-03-31 00:00:00]|
Upvotes: 1
Views: 2092
Reputation: 23109
Creating a UDF
and calculating the dates between fromDate
and toDate
could solve the problem. I have used Joda Time API for the simplicity. You need to add the dependency for that one as
For SBT:
libraryDependencies += "joda-time" % "joda-time" % "2.8.1"
Below is the example for your problem
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
("2017-04-03 00:00:00,2017-03-31 00:00:00"),
("2017-03-31 00:00:00,2017-03-30 00:00:00"),
("2017-03-30 00:00:00,2017-03-29 00:00:00"),
("2017-03-29 00:00:00,2017-03-28 00:00:00"),
("2017-03-28 00:00:00,2017-03-27 00:00:00"),
("2017-04-03 00:00:00,2017-03-31 00:00:00"),
("2017-04-06 00:00:00,2017-04-05 00:00:00")
)).toDF("dateRanges")
val calculateDate = udf((date: String) => {
val dtf = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss")
val from = dtf.parseDateTime(date.split(",")(0)).toDateTime()
val to = dtf.parseDateTime(date.split(",")(1)).toDateTime()
val dates = scala.collection.mutable.MutableList[String]()
var toDate = to
while(from.getMillis != toDate.getMillis){
if (from.getMillis > toDate.getMillis){
dates += from.toString(dtf)
toDate = toDate.plusDays(1)
}
else {
dates += from.toString(dtf)
toDate = toDate.minusDays(1)
}
}
dates
})
data.withColumn("newDate", calculateDate(data("dateRanges")))
This works for both case if your toDate
is smaller or greater than fromDate
.
Hope this helps!
Upvotes: 1
Reputation: 265
I have tried below code snippet and its working for me.
import org.apache.spark.sql.functions._
import org.joda.time.LocalDate
def dayIterator(start: LocalDate, end: LocalDate) = Iterator.iterate(start)(_ plusDays 1) takeWhile (_ isBefore end)
def dateSeries( date1 : String,date2 : String) : Array[String]= {
val fromDate = new LocalDate(date1.split(" ")(0))
val toDate = new LocalDate(date2.split(" ")(0))
val series = dayIterator(fromDate,toDate).toArray
val arr = series.map(a => a.toString() + " 00:00:00.0")
arr
}
val DateSeries = udf(dateSeries(_: String, _ : String))
scala> dateSeries("2017-03-31 00:00:00.0","2017-04-03 00:00:00.0"
res53: Array[String] = Array(2017-03-31, 2017-04-01, 2017-04-02)
No i am not able to figure out even after appending " 00:00:00.0" in map operation in dateSeries method . Array it return don't have that appended string.
Upvotes: 2