Reputation: 661
I am pretty new to spark and would like some advise on how to approach the following problem.
I have candle data (high, low, open, close) for ever minute of a trading day spread across a year. This represents about 360,000 data points.
What I want to do is run some simulations across that data (and possibly every data point) and what I would like is for a given data point, get the previous (or next) x data points and then run some code across that to give a result.
Ideally, this would be in a map style function but you cannot do a nested operation in Spark. The only way that I can think about doing it is to create a DataSet of the Candle as a key and have the related data un-normalised or partitioning it on every key - either way seems inefficient.
Ideally I am looking for something that does (Candle, List) -> Double or something similar.
I am sure there is a better approach.
I am using Spark 2.1.0 and using Yarn as the scheduling engine.
Upvotes: 1
Views: 227
Reputation: 2345
I've done a fair bit of time series processing in Spark, and have spent some time thinking about exactly the same problem.
Unfortunately, in my opinion, there is no nice way to process all of the data, in the way you want, without structuring it as you suggested. I think we just have to accept that this kinda thing is an expensive operation, whether we are using Spark, pandas or Postgres.
You may hide the code complexity by using Spark SQL window functions (look at rangeBetween
/ RANGE BETWEEN
), but the essence of what you are doing cannot be escaped.
Protip: map the data to features->label once and write it to disk to make dev/testing faster!
Upvotes: 1