Macterror
Macterror

Reputation: 431

Counting subscriber numbers given events on SQL

I have a dataset on mysql in the following format, showing the history of events given some client IDs: Base Data

Text of the dataset (subscriber_table):

user_id   type  created_at
A   past_due    2021-03-27 10:15:56
A   reactivate  2021-02-06 10:21:35
A   past_due    2021-01-27 10:30:41
A   new 2020-10-28 18:53:07
A   cancel  2020-07-22 9:48:54
A   reactivate  2020-07-22 9:48:53
A   cancel  2020-07-15 2:53:05
A   new 2020-06-20 20:24:18
B   reactivate  2020-06-14 10:57:50
B   past_due    2020-06-14 10:33:21
B   new 2020-06-11 10:21:24

date_table:

full_date
2020-05-01
2020-06-01
2020-07-01
2020-08-01
2020-09-01
2020-10-01
2020-11-01
2020-12-01
2021-01-01
2021-02-01
2021-03-01

I have been struggling to come up with a query to count subscriber counts given a range of months, which are not necessary included in the event table either because the client is still subscribed or they cancelled and later resubscribed. The output I am looking for is this:

Output

date    subscriber_count
2020-05-01  0
2020-06-01  2
2020-07-01  2
2020-08-01  1
2020-09-01  1
2020-10-01  2
2020-11-01  2
2020-12-01  2
2021-01-01  2
2021-02-01  2
2021-03-01  2

Reactivation and Past Due events do not change the subscription status of the client, however only the Cancel and New event do. If the client cancels in a month, they should still be counted as active for that month.

My initial approach was to get the latest entry given a month per subscriber ID and then join them to the premade date table, but when I have months missing I am unsure on how to fill them with the correct status. Maybe a lag function?

with last_record_per_month as (
    select
date_trunc('month', created_at)::date order by created_at) as month_year ,
        user_id ,
        type,
        created_at as created_at
    from
        subscriber_table
    where
        user_id in ('A', 'B')
    order by
        created_at desc
    ), final as (
    select
        month_year,
        created_at,
        type
    from
        last_record_per_month lrpm
    right join (
        select
            date_trunc('month', full_date)::date as month_year
        from
            date_table
        where
            full_date between '2020-05-01' and '2021-03-31'
        group by
            1
        order by
            1
        ) dd
    on lrpm.created_at = dd.month_year
    and num = 1
    order by
        month_year
    )
select
    *
from
    final

I do have a premade base table with every single date in many years to use as a joining table

Any help with this is GREATLY appreciated

Thanks!

Upvotes: 1

Views: 139

Answers (1)

Praveenrajan27
Praveenrajan27

Reputation: 641

The approach here is to have the subscriber rows with new connections as base and map them to the cancelled rows using a self join. Then have the date tables as base and aggregate them based on the number of users to get the result.

SELECT full_date, COUNT(DISTINCT user_id) FROM date_tbl
LEFT JOIN(
SELECT new.user_id,new.type,new.created_at created_at_new,
IFNULL(cancel.created_at,CURRENT_DATE) created_at_cancel
FROM subscriber new 
LEFT JOIN subscriber cancel 
ON new.user_id=cancel.user_id
AND new.type='new' AND cancel.type='cancel'
AND new.created_at<= cancel.created_at
WHERE new.type IN('new'))s
ON DATE_FORMAT(s.created_at_new, '%Y-%m')<=DATE_FORMAT(full_date, '%Y-%m')
AND DATE_FORMAT(s.created_at_cancel, '%Y-%m')>=DATE_FORMAT(full_date, '%Y-%m')
GROUP BY 1

Let me breakdown some sections

  1. First up we need to have the subscriber table self joined based on user_id and then left table with rows as 'new' and the right one with 'cancel' new.type='new' AND cancel.type='cancel'

  2. The new ones should always precede the canceled rows so adding this new.created_at<= cancel.created_at

  3. Since we only care about the rows with new in the base table we filter out the rows in the WHERE clause new.type IN('new'). The result of the subquery would look something like this enter image description here

  4. We can then join this subquery with a Left join the date table such that the year and month of the created_at_new column is always less than equal to the full_date DATE_FORMAT(s.created_at_new, '%Y-%m')<=DATE_FORMAT(full_date, '%Y-%m') but greater than that of the canceled date.

  5. Lastly we aggregate based on the full_date and consider the unique count of users

enter image description here

fiddle

Upvotes: 1

Related Questions