Spark time series query with two date columns

Question

I need to do some calculations based on historical data in Spark, but my case is a little different than examples that float all over the Internet. I have a dataset with 3 columns: enter_date, exit_date, client_id. I need to calculate online client counts between hourly intervals.

For example consider following data:

enter_date             | exit_date               | client_id
2017-03-01 12:30:00    | 2017-03-01 13:30:00     | 1
2017-03-01 12:45:00    | 2017-03-01 14:10:00     | 2
2017-03-01 13:00:00    | 2017-03-01 15:20:00     | 3

I must get following as result:

time_interval          | count
2017-03-01 12:00:00    | 2
2017-03-01 13:00:00    | 3
2017-03-01 14:00:00    | 2
2017-03-01 15:00:00    | 1

As you can see, calculation must be performed based on not only enter_date, but both enter_date and exit_date columns.

So, there are mainly 2 questions:

Is spark able to do this type of calculations?
If yes, how?

pasha701 · Accepted Answer

On Scala can be implemented in this way, guess, Python is similar:

val clientList = List(
  Client("2017-03-01 12:30:00", "2017-03-01 13:30:00", 1),
  Client("2017-03-01 12:45:00", "2017-03-01 14:10:00", 2),
  Client("2017-03-01 13:00:00", "2017-03-01 15:20:00", 3)
)

val clientDF = sparkContext.parallelize(clientList).toDF
val timeFunctions = new TimeFunctions()

val result = clientDF.flatMap(
   // return list of times between "enter_date" and "exit_date"
  row => timeFunctions.getDiapason(row.getAs[String]("enter_date"), row.getAs[String]("exit_date"))
).map(time => (time, 1)).reduceByKey(_ + _).sortByKey(ascending = true)

result.foreach(println(_))

Result is folowing:

(2017-03-01 12:00:00,2)
(2017-03-01 13:00:00,3)
(2017-03-01 14:00:00,2)
(2017-03-01 15:00:00,1)

TimeFunctions can be implemented like:

  val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
  def getDiapason(from: String, to: String): Seq[String] = {
      var fromDate = LocalDateTime.parse(from,formatter).withSecond(0).withMinute(0)
      val result = ArrayBuffer(formatter.format(fromDate))

      val toDate = LocalDateTime.parse(to, formatter).withSecond(0).withMinute(0)
      while (toDate.compareTo(fromDate) > 0) {
        fromDate = fromDate.plusHours(1)
        result += formatter.format(fromDate)
      }
      result
    }

Spark time series query with two date columns

Answers (2)

my setup

how i did it

the result

Related Questions