DanZimmerman
DanZimmerman

Reputation: 1716

SQL - use start and end date to calculate number of students enrolled

I have a table with three columns:

Each student's enrollment period is provided by start_yearmonthdate and end_yearmonthdate. Where the values are in the yymmdd format. Also, if a student has not specified an end date yet, the end_yearmonthdate will show '000000'. You can consider the timeframe is in the year of 2022.

I want to know for each month, how many students are enrolled. For example, for student 100001, Jan, Feb, and Mar are all the enrollment months for this student.

Here is the table:

student_id start_yearmonthdate end_yearmonthdate
100001 220101 220331
100002 220201 220731
100003 220101 221231
100004 220701 221031
100005 220601 220930
100006 220101 221231
100006 220201 221231
100007 220701 000000
100008 220601 000000

You can use sqlfiddle.com to build the table:

create table training (
  student_id int(6),
  start_yearmonthdate varchar(30),
  end_yearmonthdate varchar(30)
);

insert into training(student_id, start_yearmonthdate, end_yearmonthdate)
values(100001, '220101', '220331');
insert into training(student_id, start_yearmonthdate, end_yearmonthdate)
values(100002, '220201', '220731');
insert into training(student_id, start_yearmonthdate, end_yearmonthdate)
values(100003, '220101', '221231');
insert into training(student_id, start_yearmonthdate, end_yearmonthdate)
values(100004, '220701', '221031');
insert into training(student_id, start_yearmonthdate, end_yearmonthdate)
values(100005, '220601', '220930');
insert into training(student_id, start_yearmonthdate, end_yearmonthdate)
values(100006, '220101', '221231');
insert into training(student_id, start_yearmonthdate, end_yearmonthdate)
values(100006, '220201', '221231');
insert into training(student_id, start_yearmonthdate, end_yearmonthdate)
values(100007, '220701', '000000');
insert into training(student_id, start_yearmonthdate, end_yearmonthdate)
values(100008, '220601', '000000');

Update: I am trying to get this work in postgresql as well. But I am new to postgresql and the syntax is taking too much time to change/validate. Can someone provide answer for postgresql as well, thanks.

Upvotes: 0

Views: 412

Answers (3)

nbk
nbk

Reputation: 49395

MySQL has no time series, so you must generate one to get to the data, which is needed to group the data.

Still every STR_TO_DATE will cost lots of performace.

WITH RECURSIVE t_series as (
    select '2022-01-01' as dt
  UNION
    SELECT DATE_ADD(t_series.dt, INTERVAL 1 DAY) FROM t_series WHERE DATE_ADD(t_series.dt, INTERVAL 1 DAY) <= '2022-12-31'
)
select 

COUNT(DISTINCT t2.student_id)
,EXTRACT(YEAR_MONTH FROM t1.dt)
FROM t_series t1 JOIN training t2 
ON t1.dt BETWEEN STR_TO_DATE(t2.start_yearmonthdate,'%y%m%d') 
AND STR_TO_DATE(IF(t2.end_yearmonthdate = '000000','221231',t2.end_yearmonthdate),'%y%m%d')
GROUP BY EXTRACT(YEAR_MONTH FROM t1.dt);
COUNT(DISTINCT t2.student_id) | EXTRACT(YEAR_MONTH FROM t1.dt)
----------------------------: | -----------------------------:
                            3 |                         202201
                            4 |                         202202
                            4 |                         202203
                            3 |                         202204
                            3 |                         202205
                            5 |                         202206
                            7 |                         202207
                            6 |                         202208
                            6 |                         202209
                            5 |                         202210
                            4 |                         202211
                            4 |                         202212

db<>fiddle here

For posgres it will look similar

WITH t_series as (
SELECT t.day::date 
FROM   generate_series(timestamp '2022-01-01'
                     , timestamp '2022-12-31'
                     , interval  '1 day') AS t(day))
select 

COUNT(DISTINCT t2.student_id)
,to_char( t1.day, 'YYYY-MM')
FROM t_series t1 JOIN training t2 
ON t1.day BETWEEN TO_DATE(t2.start_yearmonthdate,'YYMMDD') 
AND TO_DATE(CASE WHEN t2.end_yearmonthdate = '000000' THEN '221231' ELSe t2.end_yearmonthdate END,'YYMMDD')
GROUP BY to_char( t1.day, 'YYYY-MM');
count | to_char
----: | :------
    3 | 2022-01
    4 | 2022-02
    4 | 2022-03
    3 | 2022-04
    3 | 2022-05
    5 | 2022-06
    7 | 2022-07
    6 | 2022-08
    6 | 2022-09
    5 | 2022-10
    4 | 2022-11
    4 | 2022-12

db<>fiddle here

Upvotes: 2

Thorsten Kettner
Thorsten Kettner

Reputation: 95072

One of the main problems is to get a proper table with real dates, for that is easier to work with and less prone to errors.

Once this is done, get all months from the first date to the last date in the table with a recursive query.

Then select the months and count the students who came and went.

with recursive
proper_table as
(
  select
    student_id,
    str_to_date(start_yearmonthdate, '%y%m%d') as start_date,
    case when end_yearmonthdate = '000000' 
      then null
      else str_to_date(end_yearmonthdate, '%y%m%d')
    end as end_date
  from training
),
months (last_date) as
(
  select min(last_day(start_date))
  from proper_table
  union all
  select last_date + interval 1 month
  from months
  where last_date + interval 1 month < (select max(start_date) from proper_table)
     or last_date + interval 1 month < (select max(end_date) from proper_table)
)
select 
  date_format(last_date, '%Y-%m'),
  (select count(*) from proper_table where start_date <= months.last_date) -
  (select count(*) from proper_table where end_date <= months.last_date)
    as number_of_students
from months;

Demo: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=b47cecc64b18f48385e94505f5b481bc

Upvotes: 1

Shuki Molk
Shuki Molk

Reputation: 51

There are lots of steps involved here, including creating a "months calendar" table using recursive CTE.

I tried my best to make each step clear by using expressive aliases and CTE names; However, you might need to go over each step thoroughly.

WITH Step_0_create_months_calendar AS
    (
    SELECT CAST('2022-01-01' AS date) AS month_1st_day
    UNION ALL
    SELECT DATEADD(MONTH, 1, month_1st_day)
    FROM Step_0_create_months_calendar
    WHERE month_1st_day < '2024-01-01'
    )
, Step_1_real_dates AS
    (
    SELECT student_id
    , CAST('20' + start_yearmonthdate AS date) AS start_yearmonthdate
    , CAST(CASE
            WHEN end_yearmonthdate = '000000' THEN GETDATE()
            ELSE '20' + end_yearmonthdate
        END AS date) AS end_yearmonthdate
    FROM training
    )
, step_2_explode_months_per_student AS
    (
    SELECT *
    FROM Step_1_real_dates AS S1
    INNER JOIN Step_0_create_months_calendar AS clndr
    ON clndr.month_1st_day BETWEEN S1.start_yearmonthdate AND S1.end_yearmonthdate
    )
, Step_3_just_student_year_and_month AS
    (
    SELECT student_id, YEAR(month_1st_day) as year_num ,MONTH(month_1st_day) AS month_num
    FROM step_2_explode_months_per_student
    )
-- Last step: Aggregating
SELECT year_num, month_num, COUNT(DISTINCT student_id) AS num_of_students
FROM Step_3_just_student_year_and_month
GROUP BY year_num, month_num
ORDER BY year_num, month_num

Upvotes: 2

Related Questions