tscizzle
tscizzle

Reputation: 12251

Retrieve IDs with a minimum time gap between consecutive rows

I have the following event table in Postgres 9.3:

CREATE TABLE event (
  event_id    integer PRIMARY KEY,
  user_id     integer,
  event_type  varchar,
  event_time  timestamptz
);

My goal is to retrieve all user_id's with a gap of at least 30 days between any of their events (or between their last event and the current time). An additional complication is that I only want the users who have one of these gaps occur at a later time than them performing a certain event_type 'convert'. How can this be done easily?

Some example data in the event table might look like:

INSERT INTO event (event_id, user_id, event_type, event_time)
VALUES
(10, 1, 'signIn',  '2015-05-05 00:11'),
(11, 1, 'browse',  '2015-05-05 00:12'),  -- no 'convert' event

(20, 2, 'signIn',  '2015-06-07 02:35'),
(21, 2, 'browse',  '2015-06-07 02:35'),
(22, 2, 'convert', '2015-06-07 02:36'),  -- only 'convert' event
(23, 2, 'signIn',  '2015-08-10 11:00'),  -- gap of >= 30 days
(24, 2, 'signIn',  '2015-08-11 11:00'),

(30, 3, 'convert', '2015-08-07 02:36'),  -- starting with 1st 'convert' event
(31, 3, 'signIn',  '2015-08-07 02:36'),
(32, 3, 'convert', '2015-08-08 02:36'),
(33, 3, 'signIn',  '2015-08-12 11:00'),  -- all gaps below 30 days
(33, 3, 'browse',  '2015-08-12 11:00'),  -- gap until today (2015-08-20) too small

(40, 4, 'convert', '2015-05-07 02:36'),
(41, 4, 'signIn',  '2015-05-12 11:00');  -- gap until today (2015-08-20) >= 30 days

Expected result:

user_id
--------
2
4

Upvotes: 3

Views: 227

Answers (2)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 658042

One way to do it:

SELECT user_id
FROM  (
   SELECT user_id
        , lead(e.event_time, 1, now()) OVER (PARTITION BY e.user_id ORDER BY e.event_time)
          - event_time AS gap
   FROM  (  -- only users with 'convert' event
      SELECT user_id, min(event_time) AS first_time
      FROM   event
      WHERE  event_type = 'convert'
      GROUP  BY 1
      ) e1
   JOIN   event e USING (user_id)
   WHERE  e.event_time >= e1.first_time
   ) sub
WHERE  gap >= interval '30 days'
GROUP  BY 1;

The window function lead() allows to include a default value if there is no "next row", which is convenient to cover your additional requirement "or between their last event and the current time".

Indexes

You should at least have an index on (user_id, event_time) if your table is big:

CREATE INDEX event_user_time_idx ON event(user_id, event_time);

If you do that often and the event_type 'convert' is rare, add another partial index:

CREATE INDEX event_user_time_convert_idx ON event(user_id, event_time)
WHERE  event_type = 'convert';

For many events per user

And only if gaps of 30 days are common (not a rare case).
Indexes become even more important.
Try this recursive CTE for better performance:

WITH RECURSIVE cte AS (
   (  -- parentheses required
   SELECT DISTINCT ON (user_id)
          user_id, event_time, interval '0 days' AS gap
   FROM   event
   WHERE  event_type = 'convert'
   ORDER  BY user_id, event_time
   )

   UNION ALL
   SELECT c.user_id, e.event_time, COALESCE(e.event_time, now()) - c.event_time
   FROM   cte c
   LEFT   JOIN LATERAL (
      SELECT e.event_time
      FROM   event e
      WHERE  e.user_id = c.user_id
      AND    e.event_time > c.event_time
      ORDER  BY e.event_time
      LIMIT  1     -- the next later event
      ) e ON true  -- add 1 row after last to consider gap till "now"
   WHERE  c.event_time IS NOT NULL
   AND    c.gap < interval '30 days'
   )
SELECT * FROM cte
WHERE  gap >= interval '30 days';

It has considerably more overhead, but can stop - per user - at the first gap that's big enough. If that should be the gap between the last event now, then event_time in the result is NULL.

New SQL Fiddle with more revealing test data demonstrating both queries.

Detailed explanation in these related answers:

Upvotes: 2

Juan Carlos Oropeza
Juan Carlos Oropeza

Reputation: 48197

SQL Fiddle

This is another way, probably not as neat as @Erwin but have all the step separated so is easy to adapt.

  • include_today: add a dummy event to indicate current date.
  • event_convert: calculate the first time the event convert appear for each user_id (in this case only user_id = 2222)
  • event_row: asign an unique consecutive id to each event. starting from 1 for each user_id
  • last part join all together and using rnum = rnum + 1 so could calculate date difference.
  • also the result show both event involve in the 30 days range so you can see if that is the result you want.

.

WITH include_today as (
     (SELECT 'xxxx' event_id, user_id, 'today' event_type, current_date as event_time
     FROM users)
     UNION
     (SELECT *
     FROM event)
),
event_convert as (
     SELECT user_id, MIN(event_time) min_time
     FROM event
     WHERE event_type = 'convert'
     GROUP BY user_id
),
event_row as (
    SELECT *, row_number() OVER (PARTITION BY user_id ORDER BY event_time desc) as rnum
    FROM 
        include_today 
 )
SELECT 
    A.user_id,
    A.event_id eventA, 
    A.event_type typeA,
    A.event_time timeA,
    B.event_id eventB, 
    B.event_type typeB,
    B.event_time timeB,
    (B.event_time - A.event_time) days
FROM
    event_convert e 
    Inner Join event_row A 
        ON e.user_id = A.user_id and e.min_time <= a. event_time 
    Inner Join event_row B 
        ON A.rnum = B.rnum + 1
       AND A.user_id = B.user_id
WHERE 
   (B.event_time - A.event_time) > interval '30 days'
ORDER BY 1,4

Upvotes: 0

Related Questions