Reputation: 1918
I have a dataset like the following:
start_date finish_date
0 2020-06-01 2020-06-02
1 2020-06-02 2020-06-04
2 2020-06-03 NaT
3 2020-06-04 2020-06-07
4 2020-06-05 2020-06-07
5 2020-06-06 NaT
6 2020-06-07 NaT
which can be re-created using the following code:
import pandas as pd
df = pd.DataFrame({
'start_date': ['2020-06-01', '2020-06-02', '2020-06-03', '2020-06-04', '2020-06-05', '2020-06-06', '2020-06-07'],
'finish_date': ['2020-06-02', '2020-06-04', pd.NA, '2020-06-07', '2020-06-07', pd.NA, pd.NA],
})
df['start_date'] = df['start_date'].apply(pd.to_datetime)
df['finish_date'] = df['finish_date'].apply(pd.to_datetime)
The question is: how to get the count of rows which have no finish_date
or not yet finished by the reporting_date
, the following is the expected result:
reporting_date not_finished
0 2020-06-01 1
1 2020-06-02 1
2 2020-06-03 2
3 2020-06-04 2
4 2020-06-05 3
5 2020-06-06 4
6 2020-06-07 3
To explain the expected result above:
(when I say row, I refer to the row in the dataset not the result)
reporting_date
2020-06-01
, the row 0
has started but not finished, that counts 1
.reporting_date
2020-06-02
, the row 0
has already finished and row 1
has started but not finished yet, that counts 1
.reporting_date
2020-06-03
, the rows 1
and 2
have started but not finished yet, that counts 2
.reporting_date
2020-06-04
, the row 2
has already finished and rows 2
and 3
have started but not finished yet, that counts 2
.Upvotes: 3
Views: 87
Reputation: 25239
Use list comprehesion to compare each reporting_date
to previous values of finish_date
. To handle NaT
, you need to fillna
it by pd.Timestamp.max
s = df.finish_date.fillna(pd.Timestamp.max)
df['not_finished'] = [(x < s.loc[:i]).sum() for i, x in enumerate(df.start_date)]
Out[35]:
start_date finish_date not_finished
0 2020-06-01 2020-06-02 1
1 2020-06-02 2020-06-04 1
2 2020-06-03 NaT 2
3 2020-06-04 2020-06-07 2
4 2020-06-05 2020-06-07 3
5 2020-06-06 NaT 4
6 2020-06-07 NaT 3
Upvotes: 1
Reputation: 195408
def fn():
d, open_dates = (yield), []
while True:
open_dates = [od for od in open_dates + [d.finish_date] if od is pd.NaT or od > d.start_date]
d = yield len(open_dates)
f = fn();next(f)
df['not_finished'] = df.apply(lambda x: f.send(x), axis=1)
print(df)
Prints:
start_date finish_date not_finished
0 2020-06-01 2020-06-02 1
1 2020-06-02 2020-06-04 1
2 2020-06-03 NaT 2
3 2020-06-04 2020-06-07 2
4 2020-06-05 2020-06-07 3
5 2020-06-06 NaT 4
6 2020-06-07 NaT 3
Upvotes: 1