0Ajax0
0Ajax0

Reputation: 37

Join and count distinct values between 2 dates

I have a left table:

CREATE TABLE tab (
  code int
, max  date
, min  date
);

And a right table with user activity (one line per user per day):

CREATE TABLE activity (
  date   date
, "user" text
);

Sample data:

Table tab (with the added column I am trying to compute):

Code  Min          Max            (Expected output)
201   2019-1-8    2019-1-10           3
202   2019-1-8    2019-1-11           3
203   2019-1-11   2019-1-12           2

Table activity:

Date              User
2019-1-8         ABCD001
2019-1-8         ABCD002
2019-1-9         ABCD001
2019-1-9         ABCD003
2019-1-10        ABCD001
2019-1-11        ABCD002
2019-1-12        ABCD003

I want as output the columns of tab, extended with the count of distinct users activity within that date range

So I need to join on the date range between min and max somehow.

Trial 1:

select code, min, max, count(b.distinct user)
from tab a
left join activity b on b.date between a.min and a.max
group by 1,2,3 

Trial 2:

select code, min, max, count(b.distinct user)
from tab a
left join activity b on b.date <= a.min and b.date >=a.max and b.date = a.min
group by 1,2,3

I ran different versions of the above code, but it either takes ages to run or fails.

Upvotes: 1

Views: 1026

Answers (1)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656734

I suggest a LEFT JOIN LATERAL:

SELECT a.code, a.min, a.max, b.ct
FROM   tab a
LEFT   JOIN LATERAL (
   SELECT count(DISTINCT b."user") AS ct
   FROM   activity b
   WHERE  b.date BETWEEN a.min AND a.max
   ) b ON true;

A query like this can be expensive if ranges are not trivially small. Each row on the left involves a range of rows on the right. A DISTINCT count is expensive on top of it, for big ranges. But this should be the fastest query possible.

Support it with an index on activity(date, "user"). Only add the 2nd index column if you get index-only scans out of it.

user is a reserved word, btw. Don't use it as identifier.
And I wouldn't use date, min or max either. Names of basic data types and functions as identifiers can lead to confusing errors.

Related:

Upvotes: 1

Related Questions