Calculating club membership data (new, past and current) during a period

Question

I am working with data for club memberships.

Here the aim is, for a period between two dates, to calculate the following, in MySQL:

the number of new members during the period.
the number of members who cancelled during the period
the number of members who re-joined during the period

For example, suppose we have the following membership data:

msid  id  start                 cancelled
1     1   2020-01-01 09:00:00   null
2     2   2020-01-01 09:00:00   2020-12-31 09:00:00
3     2   2021-01-01 09:00:00   null
4     3   2020-01-01 09:00:00   2020-06-30 09:00:00
5     3   2020-02-01 09:00:00   2020-06-30 09:00:00
6     3   2020-07-01 09:00:00   null

where msid is the primary key for this table, and id is the member ID.

For the following selection of periods we should return the following:

period_start         period_end           new  cancelled  rejoined
2020-01-01 00:00:00  2020-01-01 23:59:59  3    0          0
2020-01-01 00:00:00  2020-06-30 23:59:59  3    1          0
2020-01-01 00:00:00  2020-07-01 23:59:59  3    1          1
2020-01-01 00:00:00  2020-12-31 23:59:59  3    2          1
2020-01-01 00:00:00  2021-01-01 23:59:59  3    2          2
2020-07-01 00:00:00  2021-01-01 23:59:59  0    1          2
2021-01-01 00:00:00  2021-01-01 23:59:59  0    0          1

It is possible for a member to have multiple current memberships, as in the case of ID 3, but they should only be counted once when they cancel.

Here is a db<>fiddle with the membership data in table dt and the time periods in table periods

buddemat · Accepted Answer

Since I am still learning myself, I am not sure this is the minimal solution, but it does the job.

As I understand, you define a member as being new when they start their first membership (rather than when starting a second new membership, or member 3 would have counted as new twice). That means that selecting the minimum start date for each member gives you the corresponding dates.
```
SELECT DISTINCT id, MIN(start) OVER (PARTITION BY id) AS start FROM dt
```
Cancellation dates are pretty straightforward, they are basically the non NULL entries in the respective column.
```
SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL
```
From your example, I gather that re-joining dates should be start dates that occur after a previous cancellation. Therefore, we can re-use the second query above to get them. I select DISTINCT to make sure that each date is only listed once for each user (if e.g. a member cancels and re-joins multiple times)
```
WITH
    tab_cancelled as (SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL)
SELECT DISTINCT dt.id, dt.start as rejoined 
           FROM dt 
     INNER JOIN tab_cancelled tc
             ON tc.id = dt.id 
            AND dt.start > tc.cancelled  
```

To put it all together, I first used nested LEFT JOINS that check whether the respective dates lie within each period and use COUNT() OVER PARTITION to get the amounts for each period:

WITH 
  tab_start as (SELECT DISTINCT id, min(start) over (partition by id) as start FROM dt),
  tab_cancelled as (SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL),
  tab_rejoined as (SELECT DISTINCT dt.id, dt.start as rejoined FROM dt INNER JOIN tab_cancelled tc ON tc.id = dt.id AND dt.start > tc.cancelled)
SELECT DISTINCT period_start,
                period_end,
                new,
                cancelled,
                COUNT(dt3.id) over (partition by period_start, period_end) as rejoined
  FROM (SELECT DISTINCT period_start, 
                        period_end, 
                        new,
                        COUNT(dt2.id) over (partition by period_start, period_end) as cancelled
          FROM (SELECT DISTINCT period_start,
                                period_end, 
                                COUNT(dt1.id) over (partition by period_start, period_end) as new
                  FROM periods
             LEFT JOIN tab_start as dt1
                    ON dt1.start between period_start and period_end) u
     LEFT JOIN tab_cancelled as dt2
            ON dt2.cancelled between period_start and period_end) v
LEFT JOIN tab_rejoined as dt3
       ON dt3.rejoined between period_start and period_end

Alternatively, the same can be achieved without window functions using COUNT(DISTINCT x) and GROUP BY:

WITH 
  tab_start as (SELECT DISTINCT id, min(start) over (partition by id) as start FROM dt),
  tab_cancelled as (SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL),
  tab_rejoined as (SELECT DISTINCT dt.id, dt.start as rejoined FROM dt INNER JOIN tab_cancelled tc ON tc.id = dt.id AND dt.start > tc.cancelled)
SELECT DISTINCT period_start, 
                period_end, 
                COUNT(DISTINCT dt1.id) as new,
                COUNT(DISTINCT dt2.id, dt2.cancelled) as cancelled,
                COUNT(DISTINCT dt3.id, dt3.rejoined) as rejoined
     FROM periods
LEFT JOIN tab_start as dt1
       ON dt1.start between period_start and period_end
LEFT JOIN tab_cancelled as dt2
       ON dt2.cancelled between period_start and period_end
LEFT JOIN tab_rejoined as dt3
       ON dt3.rejoined between period_start and period_end
GROUP BY period_start, period_end

Either way, this should give you the desired result:

period_start	period_end	new	cancelled	rejoined
2020-01-01 00:00:00	2020-01-01 23:59:59	3	0	0
2020-01-01 00:00:00	2020-06-30 23:59:59	3	1	0
2020-01-01 00:00:00	2020-07-01 23:59:59	3	1	1
2020-01-01 00:00:00	2020-12-31 23:59:59	3	2	1
2020-01-01 00:00:00	2021-01-01 23:59:59	3	2	2
2020-07-01 00:00:00	2021-01-01 23:59:59	0	1	2
2021-01-01 00:00:00	2021-01-01 23:59:59	0	0	1

Note that both of this this will (by design) count multiple cancellations for a single user with multiple memberships individually if they should occur on different dates. The same is true if the same user quits and joins multiple times within a period. If you don't want both of this to happen, you can COUNT(DISTINCT id) within each period, i.e. count distinct ids rather than datetime-id-combinations.

WITH 
  tab_start as (SELECT DISTINCT id, min(start) over (partition by id) as start FROM dt),
  tab_cancelled as (SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL),
  tab_rejoined as (SELECT DISTINCT dt.id, dt.start as rejoined FROM dt INNER JOIN tab_cancelled tc ON tc.id = dt.id AND dt.start > tc.cancelled)
SELECT DISTINCT period_start, 
                period_end, 
                COUNT(DISTINCT dt1.id) as new,
                COUNT(DISTINCT dt2.id) as cancelled,
                COUNT(DISTINCT dt3.id) as rejoined
     FROM periods
LEFT JOIN tab_start as dt1
       ON dt1.start between period_start and period_end
LEFT JOIN tab_cancelled as dt2
       ON dt2.cancelled between period_start and period_end
LEFT JOIN tab_rejoined as dt3
       ON dt3.rejoined between period_start and period_end
GROUP BY period_start, period_end

For your sample data, the result is the same, so you have to take a look at your edge cases and decide how you want them to be counted.

As I wrote, maybe this can be condensed somehow or improved performance-wise, but this is what I can do.

You can find an according db<>fiddle here.

Calculating club membership data (new, past and current) during a period

Answers (1)

Related Questions