Reputation: 21

Timeseries differencing - ArangoDB (AQL or Python)

I have a collection which holds documents, with each document having a data observation and the time that the data was captured. e.g. { _key:...., "data":26, "timecaptured":1643488638.946702 }

where timecaptured for now is a utc timestamp.

What I want to do is get the duration between consecutive observations, with SQL I could do this with LAG for example, but with ArangoDB and AQL I am struggling to see how to do this at the database. So effectively the difference in timestamps between two documents in time order. I have a lot of data and I don't really want to pull it all into pandas.

Any help really appreciated.

Upvotes: 2

Answers (2)

mpoeter

Reputation: 2949

Although the solution provided by CodeManX works, I prefer a different one:

FOR d IN docs
  SORT d.timecaptured
  WINDOW { preceding: 1 } AGGREGATE s = SUM(d.timecaptured), cnt = COUNT(1)
  LET timediff = cnt == 1 ? null : d.timecaptured - (s - d.timecaptured)
  RETURN timediff

We simply calculate the sum of the previous and the current document, and by subtracting the current document's timecaptured we can therefore calculate the timecaptured of the previous document. So now we can easily calculate the requested difference.

I only use the COUNT to return null for the first document (which has no predecessor). If you are fine with having a difference of zero for the first document, you can simply remove it.

However, neither approach is very straight forward or obvious. I put on my TODO list to add an APPEND aggregate function that could be used in WINDOW and COLLECT operations.

Upvotes: 2

CodeManX

Reputation: 11885

The WINDOW function doesn't give you direct access to the data in the sliding window but here is a rather clever workaround:

FOR doc IN collection
  SORT doc.timecaptured
  WINDOW { preceding: 1 }
  AGGREGATE d = UNIQUE(KEEP(doc, "_key", "timecaptured"))
  LET timediff = doc.timecaptured - d[0].timecaptured
  RETURN MERGE(doc, {timediff})

The UNIQUE() function is available for window aggregations and can be used to get at the desired data (previous document). Aggregating full documents might be inefficient, so a projection should do, but remember that UNIQUE() will remove duplicate values. A document _key is unique within a collection, so we can add it to the projection to make sure that UNIQUE() doesn't remove anything.

The time difference is calculated by subtracting the previous' documents timecaptured value from the current document's one. In the case of the first record, d[0] is actually equal to the current document and the difference ends up being 0, which I think is sensible. You could also write d[-1].timecaptured - d[0].timecaptured to achieve the same. d[1].timecaptured - d[0].timecaptured on the other hand will give you the inverted timestamp for the first record because d[1] is null (no previous document) and evaluates to 0.

There is one risk: UNIQUE() may alter the order of the documents. You could use a subquery to sort by timecaptured again:

LET timediff = doc.timecaptured - (
  FOR dd IN d SORT dd.timecaptured LIMIT 1 RETURN dd.timecaptured
)[0]

But it's not great for performance to use a subquery. Instead, you can use the aggregation variable d to access both documents and calculate the absolute value of the subtraction so that the order doesn't matter:

LET timediff = ABS(d[-1].timecaptured - d[0].timecaptured)

Upvotes: 1

Timeseries differencing - ArangoDB (AQL or Python)

Answers (2)

Related Questions