Mazil_tov998
Mazil_tov998

Reputation: 426

How to get cumulative counts for each ID at that point in time based on entries in another column ina a pandas dataframe

I have a dataframe as shown below:

CLIENT_ID ENCOUNTER_DATE CONDITION
8222 2020-01-01 Positive
8222 2020-03-02 Treated
8222 2020-04-18 Treated
8222 2020-07-31 Negative
8300 2017-06-10 Negative
8300 2017-09-11 Treated
8300 2018-02-01 Future Treatment
8300 2018-04-01 Treated
8300 2018-05-31 Negative
8400 2020-12-31 Future Treatment
8401 2017-08-29 Negative
8401 2017-09-15 Positive
8500 2018-10-10 Positive

Here is the code to create df:

df = pd.DataFrame({"CLIENT_ID": [8222, 8222, 8222, 8222, 8300, 8300, 8300, 8300, 8300, 8400, 8401, 8401, 8500],
                   "ENCOUNTER_DATE": ['2020-01-01', '2020-03-02', '2020-04-18', '2020-07-31', '2017-06-10', '2017-09-11', '2018-02-01', '2018-04-01', '2018-05-31', '2020-12-31', '2017-08-29', '2017-09-15', '2018-10-10'],
                   "CONDITION": ["positive", "treated", "treated", "negative", "negative", "treated", "future treatment", "treated", "negative", "future treatment", "negative", "positive", "positive"]})

manage_condition_list = ['positive','treated','future treatment']

The table is sorted by the CLIENT_ID and DATE_ENCOUNTER.

I want to get the cumulative count (number of times) that client, CLIENT_ID had a CONDITION in the list manage_condition_list at that point in time. So that final dataframe or output will look like below:

CLIENT_ID ENCOUNTER_DATE CONDITION CONDITION_COUNTS
8222 2020-01-01 Positive 1
8222 2020-03-02 Treated 2
8222 2020-04-18 Treated 3
8222 2020-07-31 Negative 3
8300 2017-06-10 Negative 0
8300 2017-09-11 Treated 1
8300 2018-02-01 Future Treatment 2
8300 2018-04-01 Treated 3
8300 2018-05-31 Negative 3
8400 2020-12-31 Future Treatment 1
8401 2017-08-29 Negative 0
8401 2017-09-15 Positive 1
8500 2018-10-10 Positive 1

Note the real data has significantly more entries NOT in the manage_condition_list. I am thinking a combination of df.where and cumcount() + 1 but am not too sure.

Upvotes: 3

Views: 1451

Answers (2)

doppelherz7
doppelherz7

Reputation: 11

Not sure whether I understood the logic behind cum_counts, but hope this helps

df['Cum_Count']= df.groupby('CLIENT_ID').cumcount('Condition')
df

or

df['Cum_Count']= df.groupby('CLIENT_ID')['CONDITION'].cumcount()

or

df['CONDITION_COUNTS'] = (df['CONDITION'].isin(manage_condition_list).groupby(df['CLIENT_ID']).cumcount())

Upvotes: 0

Ben.T
Ben.T

Reputation: 29635

use isin to get True if the value is in the list manage_condition_list on the column CONDITION, then groupby.cumsum by the CLIENT_ID column

df['CONDITION_COUNTS'] = (
    df['CONDITION'].isin(manage_condition_list)
      .groupby(df['CLIENT_ID']).cumsum()
)
print(df)
    CLIENT_ID ENCOUNTER_DATE         CONDITION  CONDITION_COUNTS
0        8222     2020-01-01          positive                 1
1        8222     2020-03-02           treated                 2
2        8222     2020-04-18           treated                 3
3        8222     2020-07-31          negative                 3
4        8300     2017-06-10          negative                 0
5        8300     2017-09-11           treated                 1
6        8300     2018-02-01  future treatment                 2
7        8300     2018-04-01           treated                 3
8        8300     2018-05-31          negative                 3
9        8400     2020-12-31  future treatment                 1
10       8401     2017-08-29          negative                 0
11       8401     2017-09-15          positive                 1
12       8500     2018-10-10          positive                 1

Upvotes: 3

Related Questions