Joe King
Joe King

Reputation: 3021

Calculating club membership data (new, past and current) during a period

I am working with data for club memberships.

Here the aim is, for a period between two dates, to calculate the following, in MySQL:

For example, suppose we have the following membership data:

msid  id  start                 cancelled
1     1   2020-01-01 09:00:00   null
2     2   2020-01-01 09:00:00   2020-12-31 09:00:00
3     2   2021-01-01 09:00:00   null
4     3   2020-01-01 09:00:00   2020-06-30 09:00:00
5     3   2020-02-01 09:00:00   2020-06-30 09:00:00
6     3   2020-07-01 09:00:00   null

where msid is the primary key for this table, and id is the member ID.

For the following selection of periods we should return the following:

period_start         period_end           new  cancelled  rejoined
2020-01-01 00:00:00  2020-01-01 23:59:59  3    0          0
2020-01-01 00:00:00  2020-06-30 23:59:59  3    1          0
2020-01-01 00:00:00  2020-07-01 23:59:59  3    1          1
2020-01-01 00:00:00  2020-12-31 23:59:59  3    2          1
2020-01-01 00:00:00  2021-01-01 23:59:59  3    2          2
2020-07-01 00:00:00  2021-01-01 23:59:59  0    1          2
2021-01-01 00:00:00  2021-01-01 23:59:59  0    0          1

It is possible for a member to have multiple current memberships, as in the case of ID 3, but they should only be counted once when they cancel.

Here is a db<>fiddle with the membership data in table dt and the time periods in table periods

Upvotes: 1

Views: 195

Answers (1)

buddemat
buddemat

Reputation: 5301

Since I am still learning myself, I am not sure this is the minimal solution, but it does the job.

  1. As I understand, you define a member as being new when they start their first membership (rather than when starting a second new membership, or member 3 would have counted as new twice). That means that selecting the minimum start date for each member gives you the corresponding dates.

    SELECT DISTINCT id, MIN(start) OVER (PARTITION BY id) AS start FROM dt
    
  2. Cancellation dates are pretty straightforward, they are basically the non NULL entries in the respective column.

    SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL
    
  3. From your example, I gather that re-joining dates should be start dates that occur after a previous cancellation. Therefore, we can re-use the second query above to get them. I select DISTINCT to make sure that each date is only listed once for each user (if e.g. a member cancels and re-joins multiple times)

    WITH
        tab_cancelled as (SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL)
    SELECT DISTINCT dt.id, dt.start as rejoined 
               FROM dt 
         INNER JOIN tab_cancelled tc
                 ON tc.id = dt.id 
                AND dt.start > tc.cancelled  
    

To put it all together, I first used nested LEFT JOINS that check whether the respective dates lie within each period and use COUNT() OVER PARTITION to get the amounts for each period:

WITH 
  tab_start as (SELECT DISTINCT id, min(start) over (partition by id) as start FROM dt),
  tab_cancelled as (SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL),
  tab_rejoined as (SELECT DISTINCT dt.id, dt.start as rejoined FROM dt INNER JOIN tab_cancelled tc ON tc.id = dt.id AND dt.start > tc.cancelled)
SELECT DISTINCT period_start,
                period_end,
                new,
                cancelled,
                COUNT(dt3.id) over (partition by period_start, period_end) as rejoined
  FROM (SELECT DISTINCT period_start, 
                        period_end, 
                        new,
                        COUNT(dt2.id) over (partition by period_start, period_end) as cancelled
          FROM (SELECT DISTINCT period_start,
                                period_end, 
                                COUNT(dt1.id) over (partition by period_start, period_end) as new
                  FROM periods
             LEFT JOIN tab_start as dt1
                    ON dt1.start between period_start and period_end) u
     LEFT JOIN tab_cancelled as dt2
            ON dt2.cancelled between period_start and period_end) v
LEFT JOIN tab_rejoined as dt3
       ON dt3.rejoined between period_start and period_end

Alternatively, the same can be achieved without window functions using COUNT(DISTINCT x) and GROUP BY:

WITH 
  tab_start as (SELECT DISTINCT id, min(start) over (partition by id) as start FROM dt),
  tab_cancelled as (SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL),
  tab_rejoined as (SELECT DISTINCT dt.id, dt.start as rejoined FROM dt INNER JOIN tab_cancelled tc ON tc.id = dt.id AND dt.start > tc.cancelled)
SELECT DISTINCT period_start, 
                period_end, 
                COUNT(DISTINCT dt1.id) as new,
                COUNT(DISTINCT dt2.id, dt2.cancelled) as cancelled,
                COUNT(DISTINCT dt3.id, dt3.rejoined) as rejoined
     FROM periods
LEFT JOIN tab_start as dt1
       ON dt1.start between period_start and period_end
LEFT JOIN tab_cancelled as dt2
       ON dt2.cancelled between period_start and period_end
LEFT JOIN tab_rejoined as dt3
       ON dt3.rejoined between period_start and period_end
GROUP BY period_start, period_end

Either way, this should give you the desired result:

period_start period_end new cancelled rejoined
2020-01-01 00:00:00 2020-01-01 23:59:59 3 0 0
2020-01-01 00:00:00 2020-06-30 23:59:59 3 1 0
2020-01-01 00:00:00 2020-07-01 23:59:59 3 1 1
2020-01-01 00:00:00 2020-12-31 23:59:59 3 2 1
2020-01-01 00:00:00 2021-01-01 23:59:59 3 2 2
2020-07-01 00:00:00 2021-01-01 23:59:59 0 1 2
2021-01-01 00:00:00 2021-01-01 23:59:59 0 0 1

Note that both of this this will (by design) count multiple cancellations for a single user with multiple memberships individually if they should occur on different dates. The same is true if the same user quits and joins multiple times within a period. If you don't want both of this to happen, you can COUNT(DISTINCT id) within each period, i.e. count distinct ids rather than datetime-id-combinations.

WITH 
  tab_start as (SELECT DISTINCT id, min(start) over (partition by id) as start FROM dt),
  tab_cancelled as (SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL),
  tab_rejoined as (SELECT DISTINCT dt.id, dt.start as rejoined FROM dt INNER JOIN tab_cancelled tc ON tc.id = dt.id AND dt.start > tc.cancelled)
SELECT DISTINCT period_start, 
                period_end, 
                COUNT(DISTINCT dt1.id) as new,
                COUNT(DISTINCT dt2.id) as cancelled,
                COUNT(DISTINCT dt3.id) as rejoined
     FROM periods
LEFT JOIN tab_start as dt1
       ON dt1.start between period_start and period_end
LEFT JOIN tab_cancelled as dt2
       ON dt2.cancelled between period_start and period_end
LEFT JOIN tab_rejoined as dt3
       ON dt3.rejoined between period_start and period_end
GROUP BY period_start, period_end

For your sample data, the result is the same, so you have to take a look at your edge cases and decide how you want them to be counted.

As I wrote, maybe this can be condensed somehow or improved performance-wise, but this is what I can do.

You can find an according db<>fiddle here.

Upvotes: 2

Related Questions