How to get the count of rows prior a specific date with some conditions

Question

I have a dataset like the following:

    start_date  finish_date
0   2020-06-01  2020-06-02
1   2020-06-02  2020-06-04
2   2020-06-03  NaT
3   2020-06-04  2020-06-07
4   2020-06-05  2020-06-07
5   2020-06-06  NaT
6   2020-06-07  NaT

which can be re-created using the following code:

import pandas as pd

df = pd.DataFrame({
    'start_date': ['2020-06-01', '2020-06-02', '2020-06-03', '2020-06-04', '2020-06-05', '2020-06-06', '2020-06-07'],
    'finish_date': ['2020-06-02', '2020-06-04', pd.NA, '2020-06-07', '2020-06-07', pd.NA, pd.NA],
})
df['start_date'] = df['start_date'].apply(pd.to_datetime)
df['finish_date'] = df['finish_date'].apply(pd.to_datetime)

The question is: how to get the count of rows which have no finish_date or not yet finished by the reporting_date, the following is the expected result:

    reporting_date  not_finished
0   2020-06-01      1
1   2020-06-02      1
2   2020-06-03      2
3   2020-06-04      2
4   2020-06-05      3
5   2020-06-06      4
6   2020-06-07      3

To explain the expected result above:

(when I say row, I refer to the row in the dataset not the result)

By reporting_date 2020-06-01, the row 0 has started but not finished, that counts 1.
By reporting_date 2020-06-02, the row 0 has already finished and row 1 has started but not finished yet, that counts 1.
By reporting_date 2020-06-03, the rows 1 and 2 have started but not finished yet, that counts 2.
By reporting_date 2020-06-04, the row 2 has already finished and rows 2 and 3 have started but not finished yet, that counts 2.
and so on..

Andy L. · Accepted Answer

Use list comprehesion to compare each reporting_date to previous values of finish_date. To handle NaT, you need to fillna it by pd.Timestamp.max

s = df.finish_date.fillna(pd.Timestamp.max)
df['not_finished'] = [(x < s.loc[:i]).sum()  for i, x in enumerate(df.start_date)]

Out[35]:
  start_date finish_date  not_finished
0 2020-06-01  2020-06-02             1
1 2020-06-02  2020-06-04             1
2 2020-06-03         NaT             2
3 2020-06-04  2020-06-07             2
4 2020-06-05  2020-06-07             3
5 2020-06-06         NaT             4
6 2020-06-07         NaT             3

How to get the count of rows prior a specific date with some conditions

Answers (2)

Related Questions