G G
G G

Reputation: 1069

Apache Spark: transform timeseries data from one row per day into 24 X 1 hr rows

I have aggregated data 1 row for 1 day. I want to split the data into 24 X 1 hr data.

Input
1 24

output
1 1
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
1 10
 ...
1 24

Upvotes: 1

Views: 46

Answers (1)

marios
marios

Reputation: 8996

Say you time series is in (day,value) pairs:

(1,10)
(2,5)
(3,4)
...

And you want to convert them into (hour,value) pairs where the value remains the same for all pairs in the same day.

(1,10)
(2,10)
(3,10)
...
(24,10)
(25,5)
...
(48,5)
(49,4)
...
(72,4)
...

Here is how to do this in basic Scala:

val timeSeries = Seq(1->10, 2->5, 3->4)

timeSeries.flatMap{ case(day,value) => 
   ((1 to 24)).map( h => ( (h+(day-1)*24),value)) 
}

Here is how to do this on Spark:

val rddTimeSeries = sc.makeRDD(timeSeries)

// Very similar with what we do in Scala
val perHourTs = rddTimeSeries.flatMap{ case(day,value) => 
   ((1 to 24)).map( hour => ( (hour + (day-1)*24 ), value)) 
}
// We can print it given that we know the list is small
println(perHourTs.collect().toList)

One complication with Spark is that data may come out of order which can mess up the order in your time series. In order to address it, the simplest way will be to sort your data before you call an action on your RDD.

// Here is how to sort your time series
perHourTs.sortBy(_._1).collect()

Upvotes: 2

Related Questions