Reputation: 3021
I am working with data for club memberships.
Here the aim is, for a period between two dates, to calculate the following, in MySQL:
For example, suppose we have the following membership data:
msid id start cancelled
1 1 2020-01-01 09:00:00 null
2 2 2020-01-01 09:00:00 2020-12-31 09:00:00
3 2 2021-01-01 09:00:00 null
4 3 2020-01-01 09:00:00 2020-06-30 09:00:00
5 3 2020-02-01 09:00:00 2020-06-30 09:00:00
6 3 2020-07-01 09:00:00 null
where msid
is the primary key for this table, and id
is the member ID.
For the following selection of periods we should return the following:
period_start period_end new cancelled rejoined
2020-01-01 00:00:00 2020-01-01 23:59:59 3 0 0
2020-01-01 00:00:00 2020-06-30 23:59:59 3 1 0
2020-01-01 00:00:00 2020-07-01 23:59:59 3 1 1
2020-01-01 00:00:00 2020-12-31 23:59:59 3 2 1
2020-01-01 00:00:00 2021-01-01 23:59:59 3 2 2
2020-07-01 00:00:00 2021-01-01 23:59:59 0 1 2
2021-01-01 00:00:00 2021-01-01 23:59:59 0 0 1
It is possible for a member to have multiple current memberships, as in the case of ID 3, but they should only be counted once when they cancel.
Here is a db<>fiddle with the membership data in table dt
and the time periods in table periods
Upvotes: 1
Views: 195
Reputation: 5301
Since I am still learning myself, I am not sure this is the minimal solution, but it does the job.
As I understand, you define a member as being new when they start their first membership (rather than when starting a second new membership, or member 3
would have counted as new twice). That means that selecting the minimum start date for each member gives you the corresponding dates.
SELECT DISTINCT id, MIN(start) OVER (PARTITION BY id) AS start FROM dt
Cancellation dates are pretty straightforward, they are basically the non NULL
entries in the respective column.
SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL
From your example, I gather that re-joining dates should be start dates that occur after a previous cancellation. Therefore, we can re-use the second query above to get them. I select DISTINCT
to make sure that each date is only listed once for each user (if e.g. a member cancels and re-joins multiple times)
WITH
tab_cancelled as (SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL)
SELECT DISTINCT dt.id, dt.start as rejoined
FROM dt
INNER JOIN tab_cancelled tc
ON tc.id = dt.id
AND dt.start > tc.cancelled
To put it all together, I first used nested LEFT JOINS
that check whether the respective dates lie within each period and use COUNT() OVER PARTITION
to get the amounts for each period:
WITH
tab_start as (SELECT DISTINCT id, min(start) over (partition by id) as start FROM dt),
tab_cancelled as (SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL),
tab_rejoined as (SELECT DISTINCT dt.id, dt.start as rejoined FROM dt INNER JOIN tab_cancelled tc ON tc.id = dt.id AND dt.start > tc.cancelled)
SELECT DISTINCT period_start,
period_end,
new,
cancelled,
COUNT(dt3.id) over (partition by period_start, period_end) as rejoined
FROM (SELECT DISTINCT period_start,
period_end,
new,
COUNT(dt2.id) over (partition by period_start, period_end) as cancelled
FROM (SELECT DISTINCT period_start,
period_end,
COUNT(dt1.id) over (partition by period_start, period_end) as new
FROM periods
LEFT JOIN tab_start as dt1
ON dt1.start between period_start and period_end) u
LEFT JOIN tab_cancelled as dt2
ON dt2.cancelled between period_start and period_end) v
LEFT JOIN tab_rejoined as dt3
ON dt3.rejoined between period_start and period_end
Alternatively, the same can be achieved without window functions using COUNT(DISTINCT x)
and GROUP BY
:
WITH
tab_start as (SELECT DISTINCT id, min(start) over (partition by id) as start FROM dt),
tab_cancelled as (SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL),
tab_rejoined as (SELECT DISTINCT dt.id, dt.start as rejoined FROM dt INNER JOIN tab_cancelled tc ON tc.id = dt.id AND dt.start > tc.cancelled)
SELECT DISTINCT period_start,
period_end,
COUNT(DISTINCT dt1.id) as new,
COUNT(DISTINCT dt2.id, dt2.cancelled) as cancelled,
COUNT(DISTINCT dt3.id, dt3.rejoined) as rejoined
FROM periods
LEFT JOIN tab_start as dt1
ON dt1.start between period_start and period_end
LEFT JOIN tab_cancelled as dt2
ON dt2.cancelled between period_start and period_end
LEFT JOIN tab_rejoined as dt3
ON dt3.rejoined between period_start and period_end
GROUP BY period_start, period_end
Either way, this should give you the desired result:
period_start | period_end | new | cancelled | rejoined |
---|---|---|---|---|
2020-01-01 00:00:00 | 2020-01-01 23:59:59 | 3 | 0 | 0 |
2020-01-01 00:00:00 | 2020-06-30 23:59:59 | 3 | 1 | 0 |
2020-01-01 00:00:00 | 2020-07-01 23:59:59 | 3 | 1 | 1 |
2020-01-01 00:00:00 | 2020-12-31 23:59:59 | 3 | 2 | 1 |
2020-01-01 00:00:00 | 2021-01-01 23:59:59 | 3 | 2 | 2 |
2020-07-01 00:00:00 | 2021-01-01 23:59:59 | 0 | 1 | 2 |
2021-01-01 00:00:00 | 2021-01-01 23:59:59 | 0 | 0 | 1 |
Note that both of this this will (by design) count multiple cancellations for a single user with multiple memberships individually if they should occur on different dates. The same is true if the same user quits and joins multiple times within a period. If you don't want both of this to happen, you can COUNT(DISTINCT id)
within each period, i.e. count distinct ids rather than datetime-id-combinations.
WITH
tab_start as (SELECT DISTINCT id, min(start) over (partition by id) as start FROM dt),
tab_cancelled as (SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL),
tab_rejoined as (SELECT DISTINCT dt.id, dt.start as rejoined FROM dt INNER JOIN tab_cancelled tc ON tc.id = dt.id AND dt.start > tc.cancelled)
SELECT DISTINCT period_start,
period_end,
COUNT(DISTINCT dt1.id) as new,
COUNT(DISTINCT dt2.id) as cancelled,
COUNT(DISTINCT dt3.id) as rejoined
FROM periods
LEFT JOIN tab_start as dt1
ON dt1.start between period_start and period_end
LEFT JOIN tab_cancelled as dt2
ON dt2.cancelled between period_start and period_end
LEFT JOIN tab_rejoined as dt3
ON dt3.rejoined between period_start and period_end
GROUP BY period_start, period_end
For your sample data, the result is the same, so you have to take a look at your edge cases and decide how you want them to be counted.
As I wrote, maybe this can be condensed somehow or improved performance-wise, but this is what I can do.
You can find an according db<>fiddle here.
Upvotes: 2