Ranyk
Ranyk

Reputation: 267

SQL: how to select unique users from every timestamp over every month

I have a table in SQL with 1000 users, with top 10 entries:

table1:

+-------------+---------------+-------------------------+
| User_id     | Mail_id       | Reg_date(y-m-d)         |
+-------------+---------------+-------------------------+
| user1       | email1        | 2019-11-09 12:23:53.253 |
| user1       | email1        | 2019-11-09 12:24:53.253 |
| user1       | email1        | 2019-11-09 13:20:53.253 |
| user1       | email1        | 2019-08-09 11:23:53.253 |
| user2       | email2        | 2019-09-08 10:29:53.253 |
| user3       | email3        | 2019-09-08 14:23:53.253 |
| user1       | email1        | 2019-12-09 13:20:53.253 |
| user1       | email1        | 2019-10-10 11:23:53.253 |
| user1       | email1        | 2019-10-13 10:29:53.253 |
| user2       | email5        | 2019-11-14 10:29:53.253 |
+-------------+---------------+-------------------------+

table2:

+-------------+---------------+-------------------------+
| User_id     | Session_id    | Activity_date(y-m-d)    |
+-------------+---------------+-------------------------+
| user1       | s1            | 2019-11-09 12:23:53.253 |
| user1       | s2            | 2019-12-09 12:24:53.253 |
| user1       | s3            | 2019-12-09 13:20:53.253 |
| user1       | s4            | 2020-01-09 11:23:53.253 |
| user2       | s5            | 2019-12-08 10:29:53.253 |
| user3       | s6            | 2020-02-08 14:23:53.253 |
| user1       | s7            | 2019-12-09 13:20:53.253 |
| user1       | s8            | 2020-03-10 11:23:53.253 |
| user1       | s9            | 2020-02-13 10:29:53.253 |
| user2       | s10           | 2020-03-14 10:29:53.253 |
+-------------+---------------+-------------------------+

I want to select the DISTINCT users only if whose Activity_date(y-m-d) starts in date between 2019-11-01 to 2019-12-15AND present in every month between the date 2019-11-01 to 2020-03-30 ( to track the user activity 3 consecutive months).

The output:

(Here User1 is the only person whose Activity_date(y-m-d) falls between 2019-11-01 to 2019-12-15 AND also present in every month after that, between the date 2019-11-01 to 2020-03-30.

User2 has the starting Activity_date(y-m-d) falls between 2019-11-01 to 2019-12-15 But the Activity_date(y-m-d) not present in every month (i.e, Month-Jan & Feb) so this user not been considered in output.

+-------------+---------------+-------------------------+
| User_id     | Mail_id       | Activity_date(y-m-d)    |
+-------------+---------------+-------------------------+
| user1       | email1        | 2019-11-09 12:23:53.253 |
| user1       | email1        | 2019-12-09 12:24:53.253 |
| user1       | email1        | 2019-12-09 13:20:53.253 |
| user1       | email1        | 2020-01-09 11:23:53.253 |
| user1       | email1        | 2019-12-09 14:20:53.253 |
| user1       | email1        | 2020-02-13 10:29:53.253 |
| user1       | email1        | 2020-03-10 11:23:53.253 |
+-------------+---------------+-------------------------+

How to achieve this in SQL (Redshift)?

Upvotes: 0

Views: 78

Answers (2)

ggordon
ggordon

Reputation: 10035

Here is a fiddle using your sample data to recreate your sample results. The db fiddle to test with is using postgres but this should also work on redshift. Let me know if this works.

The approach begins by first generating all consecutive months with a recursive cte month_periods and months before checking whether a user was active in each consecutive year-month generated in users_active_in_months. The final projection selects the User_id, Mail_id and Activity_date as shared in your target dataset where Activity_date is from 2019-11-01 to 2019-12-15 and 2019-11-01 to 2020-03-30 or simply from 2019-11-01 to 2020-03-30 since this is fully inclusive.

CREATE TABLE table1 (
  User_id VARCHAR(5),
  Mail_id VARCHAR(6),
  Reg_date TIMESTAMP
);

INSERT INTO table1
  (User_id, Mail_id, Reg_date)
VALUES
  ('user1', 'email1', '2019-11-09 12:23:53.253'),
  ('user1', 'email1', '2019-11-09 12:24:53.253'),
  ('user1', 'email1', '2019-11-09 13:20:53.253'),
  ('user1', 'email1', '2019-08-09 11:23:53.253'),
  ('user2', 'email2', '2019-09-08 10:29:53.253'),
  ('user3', 'email3', '2019-09-08 14:23:53.253'),
  ('user1', 'email1', '2019-12-09 13:20:53.253'),
  ('user1', 'email1', '2019-10-10 11:23:53.253'),
  ('user1', 'email1', '2019-10-13 10:29:53.253'),
  ('user2', 'email5', '2019-11-14 10:29:53.253');

CREATE TABLE table2 (
  User_id VARCHAR(5),
  Session_id VARCHAR(3),
  Activity_date TIMESTAMP
);

INSERT INTO table2
  (User_id, Session_id, Activity_date)
VALUES
  ('user1', 's1', '2019-11-09 12:23:53.253'),
  ('user1', 's2', '2019-12-09 12:24:53.253'),
  ('user1', 's3', '2019-12-09 13:20:53.253'),
  ('user1', 's4', '2020-01-09 11:23:53.253'),
  ('user2', 's5', '2019-12-08 10:29:53.253'),
  ('user3', 's6', '2020-02-08 14:23:53.253'),
  ('user1', 's7', '2019-12-09 13:20:53.253'),
  ('user1', 's8', '2020-03-10 11:23:53.253'),
  ('user1', 's9', '2020-02-13 10:29:53.253'),
  ('user2', 's10', '2020-03-14 10:29:53.253');
  
  

Query #1

WITH recursive month_periods AS (
    SELECT '2019-11-01'::timestamp as dt UNION ALL
    SELECT (dt + interval '1 month')::timestamp as dt 
    FROM month_periods 
    WHERE dt <= '2020-03-30'
), 
months AS (
    SELECT EXTRACT(YEAR FROM dt)*100+EXTRACT(MONTH from dt) as ym from month_periods
),
users_active_in_months AS (
    SELECT 
        User_id
    FROM (
    SELECT DISTINCT 
         m.ym,
         t2.User_id
         
    FROM 
         months m
    LEFT JOIN
         table2 t2 ON
    ( EXTRACT(YEAR FROM t2.Activity_date)*100+EXTRACT(MONTH FROM t2.Activity_date))=m.ym
    WHERE t2.User_id is NOT NULL
    ) t
    GROUP BY User_id
 HAVING COUNT(User_id) = (SELECT COUNT(1) FROM months) - 1
)
SELECT DISTINCT
    t2.User_Id,
    t1.Mail_id,
    t2.Activity_date 
FROM
    table2 t2
INNER JOIN
    table1 t1 ON t2.User_id = t1.User_id
INNER JOIN
    users_active_in_months um ON um.User_id = t1.User_id
WHERE
    t2.Activity_date BETWEEN '2019-11-01' and '2020-03-30';
user_id mail_id activity_date
user1 email1 2019-11-09T12:23:53.253Z
user1 email1 2019-12-09T12:24:53.253Z
user1 email1 2019-12-09T13:20:53.253Z
user1 email1 2020-01-09T11:23:53.253Z
user1 email1 2020-02-13T10:29:53.253Z
user1 email1 2020-03-10T11:23:53.253Z

View on DB Fiddle

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269445

You can use aggregation. From your description, table1 doesn't seem needed. You can get the user_id:

select t2.user_id
from table2 t2
group by user_id
having min(activity_date) >= '2019-11-01' and
       min(activity_date) <= '2019-12-15' and
       count(distinct date_trunc('month', activity_date)) = 5;

You can then join in whatever other information you need.

Note: The above answers the specific question you asked. However, because you want activity in every month, this really requires that the first date be in November, rather than December. You can tweak the logic to handle this.

Upvotes: 1

Related Questions