Reputation: 21
I have a collection which holds documents, with each document having a data observation and the time that the data was captured. e.g. { _key:...., "data":26, "timecaptured":1643488638.946702 }
where timecaptured for now is a utc timestamp.
What I want to do is get the duration between consecutive observations, with SQL I could do this with LAG for example, but with ArangoDB and AQL I am struggling to see how to do this at the database. So effectively the difference in timestamps between two documents in time order. I have a lot of data and I don't really want to pull it all into pandas.
Any help really appreciated.
Upvotes: 2
Views: 191
Reputation: 2949
Although the solution provided by CodeManX works, I prefer a different one:
FOR d IN docs
SORT d.timecaptured
WINDOW { preceding: 1 } AGGREGATE s = SUM(d.timecaptured), cnt = COUNT(1)
LET timediff = cnt == 1 ? null : d.timecaptured - (s - d.timecaptured)
RETURN timediff
We simply calculate the sum of the previous and the current document, and by subtracting the current document's timecaptured
we can therefore calculate the timecaptured
of the previous document. So now we can easily calculate the requested difference.
I only use the COUNT
to return null
for the first document (which has no predecessor). If you are fine with having a difference of zero for the first document, you can simply remove it.
However, neither approach is very straight forward or obvious. I put on my TODO list to add an APPEND
aggregate function that could be used in WINDOW and COLLECT operations.
Upvotes: 2
Reputation: 11885
The WINDOW
function doesn't give you direct access to the data in the sliding window but here is a rather clever workaround:
FOR doc IN collection
SORT doc.timecaptured
WINDOW { preceding: 1 }
AGGREGATE d = UNIQUE(KEEP(doc, "_key", "timecaptured"))
LET timediff = doc.timecaptured - d[0].timecaptured
RETURN MERGE(doc, {timediff})
The UNIQUE()
function is available for window aggregations and can be used to get at the desired data (previous document). Aggregating full documents might be inefficient, so a projection should do, but remember that UNIQUE()
will remove duplicate values. A document _key
is unique within a collection, so we can add it to the projection to make sure that UNIQUE()
doesn't remove anything.
The time difference is calculated by subtracting the previous' documents timecaptured
value from the current document's one. In the case of the first record, d[0]
is actually equal to the current document and the difference ends up being 0
, which I think is sensible. You could also write d[-1].timecaptured - d[0].timecaptured
to achieve the same. d[1].timecaptured - d[0].timecaptured
on the other hand will give you the inverted timestamp for the first record because d[1]
is null
(no previous document) and evaluates to 0
.
There is one risk: UNIQUE()
may alter the order of the documents. You could use a subquery to sort by timecaptured
again:
LET timediff = doc.timecaptured - (
FOR dd IN d SORT dd.timecaptured LIMIT 1 RETURN dd.timecaptured
)[0]
But it's not great for performance to use a subquery. Instead, you can use the aggregation variable d
to access both documents and calculate the absolute value of the subtraction so that the order doesn't matter:
LET timediff = ABS(d[-1].timecaptured - d[0].timecaptured)
Upvotes: 1