mohd4482
mohd4482

Reputation: 1918

How to get the count of rows prior a specific date with some conditions

I have a dataset like the following:

    start_date  finish_date
0   2020-06-01  2020-06-02
1   2020-06-02  2020-06-04
2   2020-06-03  NaT
3   2020-06-04  2020-06-07
4   2020-06-05  2020-06-07
5   2020-06-06  NaT
6   2020-06-07  NaT

which can be re-created using the following code:

import pandas as pd

df = pd.DataFrame({
    'start_date': ['2020-06-01', '2020-06-02', '2020-06-03', '2020-06-04', '2020-06-05', '2020-06-06', '2020-06-07'],
    'finish_date': ['2020-06-02', '2020-06-04', pd.NA, '2020-06-07', '2020-06-07', pd.NA, pd.NA],
})
df['start_date'] = df['start_date'].apply(pd.to_datetime)
df['finish_date'] = df['finish_date'].apply(pd.to_datetime)

The question is: how to get the count of rows which have no finish_date or not yet finished by the reporting_date, the following is the expected result:

    reporting_date  not_finished
0   2020-06-01      1
1   2020-06-02      1
2   2020-06-03      2
3   2020-06-04      2
4   2020-06-05      3
5   2020-06-06      4
6   2020-06-07      3

To explain the expected result above:

(when I say row, I refer to the row in the dataset not the result)

Upvotes: 3

Views: 87

Answers (2)

Andy L.
Andy L.

Reputation: 25239

Use list comprehesion to compare each reporting_date to previous values of finish_date. To handle NaT, you need to fillna it by pd.Timestamp.max

s = df.finish_date.fillna(pd.Timestamp.max)
df['not_finished'] = [(x < s.loc[:i]).sum()  for i, x in enumerate(df.start_date)]

Out[35]:
  start_date finish_date  not_finished
0 2020-06-01  2020-06-02             1
1 2020-06-02  2020-06-04             1
2 2020-06-03         NaT             2
3 2020-06-04  2020-06-07             2
4 2020-06-05  2020-06-07             3
5 2020-06-06         NaT             4
6 2020-06-07         NaT             3

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195408

def fn():
    d, open_dates = (yield), []
    while True:
        open_dates = [od for od in open_dates + [d.finish_date] if od is pd.NaT or od > d.start_date]
        d = yield len(open_dates)

f = fn();next(f)
df['not_finished'] = df.apply(lambda x: f.send(x), axis=1)
print(df)

Prints:

  start_date finish_date  not_finished
0 2020-06-01  2020-06-02             1
1 2020-06-02  2020-06-04             1
2 2020-06-03         NaT             2
3 2020-06-04  2020-06-07             2
4 2020-06-05  2020-06-07             3
5 2020-06-06         NaT             4
6 2020-06-07         NaT             3

Upvotes: 1

Related Questions