Reputation: 3267

How to aggregate data by a dynamic time interval

with psql I want to count every unique visitor.

A unique visitor is a visitor who did not visit less than an hour previously.

So for the following rows of users and timestamps we'd get a total count of 4 unique visitors with user1 and user2 counting as 2 respectively.

Please note that I do not want to aggregate by hour in a 24 hour day. I want to aggregate by an hour after the time stamp of the users first visit.

I'm guessing a straight up sql expression won't do it.

user1,"2015-07-13 08:28:45.247000" 
user1,"2015-07-13 08:30:17.247000"
user1,"2015-07-13 09:35:00.030000" 
user1,"2015-07-13 09:54:00.652000"
user2,"2015-07-13 08:28:45.247000" 
user2,"2015-07-13 08:30:17.247000"
user2,"2015-07-13 09:35:00.030000" 
user2,"2015-07-13 09:54:00.652000"

So user1 arrives at 8:28, that counts as one hit. He comes back at 8:30 which counts as zero. He then comes back at 9:35 which is more than an hour from 8:30, so he gets another hit. Then he comes back at 9:35 which is only 5 minutes from the last time 9:30 so this counts as zero. The total is 2 hits for user1. The same thing happens for user2 meaning two hits each bringing it to a final total of 4.

Upvotes: 0

Answers (2)

Gordon Linoff

Reputation: 1271003

Here is one method:

select count(*)
from t
where not exists (select 1
                  from t t2
                  where t2.user = t.user and
                        t2.timestamp < t.timestamp and
                        t2.timestamp > t.timestamp - interval '1 hour'
                 );

EDIT:

If duplicate timestamps are a potential issue, you can use count(distinct user, timestamp).

Note: user and timestamp are both keywords and user is reserved. Hopefully your actual columns are named something else.

The where clause only keeps records where there is no other record for the user in the hour before. That is your definition of a "new" user, so aggregating the count should be what you are looking for.

Upvotes: 2

Erwin Brandstetter

Reputation: 658927

Corner cases in your definition aside (see comment), this query is many times faster than the one @Gordon provided in my local test on Postgres 9.4, with or without index:

SELECT count(ct)
FROM  (
   SELECT user_id, CASE WHEN lead(created_at, 1, 'infinity' )
                      OVER (PARTITION BY user_id ORDER BY created_at)
                 > created_at + interval '1h' THEN true ELSE NULL END AS ct
   FROM   tbl
   ) sub;

The core feature is the window function lead() looking up the "next" row for the current user. If the next row is more than an hour away, this row counts.

lead() has the often overlooked capability to provide a default for missing rows. Filling in 'infinity' to cover the corner case.

This query works for NULL values (not counted) or duplicate values (only 1 row counts) in the timestamp column created_at.

Key to top performance is a multicolumn index:

CREATE INDEX tbl(user_id, created_at);

For a bullet-proof definition of "unique visitor" different query styles might be superior.

Upvotes: 1

How to aggregate data by a dynamic time interval

Answers (2)

Related Questions