Reputation: 331
I need to do a moving average on a large timeseries dataset using R on Spark.
I see there are implementation of this in Scala and Java: Moving Average in Spark Java, Apache Spark Moving Average but nothing in R.
Upvotes: 2
Views: 518
Reputation: 331
I managed to solve this using SparkR window functions. I'm using Spark 2.0 btw.
set.seed(123)
#generate poisson distribution for easy checking, with lambda = 15
n <- 1000
orderingColumn = seq(1,n)
data = rpois(n, 15)
df <- data.frame(orderingColumn, data)
#Create sparkdf
sdf <- as.DataFrame(df);
#Moving average
ws <- windowOrderBy(sdf$orderingColumn)
frame <- rowsBetween(ws, -100, 0) #100 observations back included in average
sdfWithMa <- withColumn(sdf, "moving_average", over(avg(sdf$data), frame))
head(sdfWithMa, 100)
One thing to be aware of with above is that Spark will attempt to load all the data into a single partition so it can be slow over large data sets, unfortunately. I wish the underlying implementation was different, though I understand that calculating sliding windows on ordered data is difficult on any system where the data is distributed.
If you are lucky enough that your moving average can be run on partitions of the data then you can change your window:
ws <- orderBy(windowPartitionBy("my_partition_column"), sdf$orderingColumn)
Upvotes: 2