Daniel.Sj
Daniel.Sj

Reputation: 45

Identify groups of rows in close proximity

I am doing a project for school and have been given a database over gps-recordings for three people during the course of a week. I am trying to group these recordings into trips based on the time between them. If a recording is within 300 seconds from the recording before it, they are considered to be part of the same trip, otherwise, they are considered part of different trips.

So far I have managed to calculate the time difference between a recording on row n and the one on row n-1 and I am now trying to create a function for merging the recordings intro trips. This would have been real easy in another programming language, but in this course we are using PostgreSQL which I am not that well versed in.

To solve this, I am trying to create a function with a variable that increases every time the time difference between two recordings is greater than 300 seconds and assigns each row to a trip based on the variable. This is as far as I have currently gotten, although at the moment, the variable resets X all the time, thus assigning all rows to trip 1...

CREATE OR REPLACE FUNCTION tripmerge(time_diff double precision)
RETURNs integer AS $$
declare 
X integer := 1;
ID integer;
BEGIN
  IF time_diff < 300 THEN
    ID = X;
  ELSE 
    ID =X;
    X:=X+1;
  END IF;
 RETURN ID;
END;$$ 
LANGUAGE plpgsql;

How do I change so X does not reset all the time? I am using PostgreSQL 9.1.

EDIT:

This is the table I am working with:

curr_rec (timestamp), prev_rec (timestamp), time_diff (double precision)

With this being a sample of the dataset:

'2013-11-14 05:22:33.991+01',null ,null
'2013-11-14 09:15:40.485+01','2013-11-14 05:22:33.991+01',13986.494
'2013-11-14 09:17:04.837+01','2013-11-14 09:15:40.485+01',84.352
'2013-11-14 09:17:43.055+01','2013-11-14 09:17:04.837+01',38.218
'2013-11-14 09:23:24.205+01','2013-11-14 09:17:43.055+01',341.15

The expected result would add a column:

tripID
1
2
2
2
3

And I think this fiddle should be working: http://sqlfiddle.com/#!1/4e3e5/1/0

Upvotes: 1

Views: 435

Answers (1)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656784

This query uses only curr_rec, not the other redundant, precomputed columns:

SELECT 1 + count(step OR NULL) OVER (ORDER BY curr_rec) AS trip_id
FROM  (
   SELECT curr_rec
         ,lag(curr_rec) OVER (ORDER BY curr_rec) AS prev_rec
         ,curr_rec - lag(curr_rec) OVER (ORDER BY curr_rec)
                                        > interval '5 min' AS step
   FROM  timestamps
   ) x;

Key features are:

  • The window function lag(), which I use to see if the previous row is more than 5 minutes ago. (Just using an interval for the comparison, no need to extract seconds)
  • The window aggregate function count() - that's just the basic aggregate function with an OVER clause.
    The expression step OR NULL only leaves TRUE or NULL, where only TRUE is counted in a running count, thereby arriving at your desired result.

SQL Fiddle (building on the one you provided).

Upvotes: 3

Related Questions