Reputation: 1657
I am working on a data in which i need to specifically calculate the number of other events occurred before a particular event. The sample data is below
In the above image there are two PUMP RUN. So what I need to derive is the cumulative number of other events before each PUMP RUN and store in a new column for each ID value. There would be multiple ID value with multiple events along with PUMP RUN event. For example, based on the above image for First occurrence of pump run within that ID value it has to be 5 and for the second occurrence it has to be 3 and so on. Once the ID value changes, the cumulative count needs to restart.
Desired Output
I tried to do this with the below code however its not getting grouped or partitioned based on the ID value.
data['Count'] = data.groupby(data['EventDescription'].str.startswith('Pump RUN').cumsum()).cumcount().add(1)
data['EventsCount'] = data.groupby('ID')['Count'].apply(lambda x: x.shift(1))
Secondly I need to find how many events had the PUMP RUN for each ID value. I was able to find this out with the below code
data['PumpRun'] = np.where(data['EventDescription'].str.startswith('Pump RUN'), 1, 0)
data['RunCount'] = data.groupby(['ID']).PumpRun.transform('sum')
However I want to know whether there is a simple way to achieve this instead of creating multiple columns for the calculation.
Below is my full code
data = pd.read_csv('eventdata.csv')
data.head()
data = data.sort_values(by=["SiteNumber", "SerialNumber","ID","Time"] , ascending = True)
data['Count'] = data.groupby(data['EventDescription'].str.startswith('Pump RUN').cumsum()).cumcount().add(1)
data['EventsCount'] = data.groupby('ID')['Count'].apply(lambda x: x.shift(1))
data['EventsCount'] = data['EventsCount'].fillna(0)
data.drop(data[data['ID'].eq('')].index, inplace = True)
data['PumpRun'] = np.where(data['EventDescription'].str.startswith('Pump RUN'), 1, 0)
data['RunCount'] = data.groupby(['ID']).PumpRun.transform('sum')
data.to_csv("events_count.csv")
Below is the link for the data.
Upvotes: 0
Views: 102
Reputation: 3720
This may at least answer the first part of your question. The idea is to identify the Pump RUN
groups with a cumsum
on a shifted boolean mask. Then groupby
on that and the ID
. From there use transform
to create the resulting EventCount
sub-column (grouped column). Note the reset across ID
boundaries.
Updated to count the events after Pump Run and before the next stop
target_event_phrases = 'Pump RUN|Pump STOP|User STOP|Infusion Complete|Bolus complete'
pump_grp = df['EventDescription'].str.contains(target_event_phrases).cumsum()
dfc = df.assign(EventCount=df.groupby(['ID',pump_grp])['EventDescription'].transform(
lambda x: ([len(x)-1] if 'Pump RUN' in x.iat[0] else [0]) + [0]*(len(x)-1)
)
)
The following is used only to pull out and display sample Run/Stop groups
g = random.choice(pump_grp.unique())
print(f'{g=}')
for i in range(g-2,g+3):
print(dfc.loc[pg == i,['ID','EventDescription','EventCount']])
print('--------------------------------------------')
Dataframe Sample Subset
g=126
ID EventDescription EventCount
469 5983688711 Pump RUN 3
470 5983688711 Pump rate updated:; - Rate 60.1 mL/hr 0
471 5983688711 Pump rate updated:; - Rate 50.1 mL/hr 0
472 5983688711 Pump rate updated:; - Rate 40.1 mL/hr 0
--------------------------------------------
ID EventDescription EventCount
473 5983688711 Pump STOP 0
474 5983688711 Downstream Occlusion! 0
475 5983688711 Downstream Occl Clr'd 0
--------------------------------------------
ID EventDescription EventCount
476 5983688711 Pump RUN - Auto-restart 3
477 5983688711 Pump rate updated:; - Rate 35.1 mL/hr 0
478 5983688711 Pump rate updated:; - Rate 30.1 mL/hr 0
479 5983688711 Pump rate updated:; - Rate 25.1 mL/hr 0
--------------------------------------------
ID EventDescription EventCount
480 5983688711 Infusion Complete 0
--------------------------------------------
ID EventDescription EventCount
481 5983688711 Pump RUN - KVO 1
482 5983688711 Pump rate updated:; - Rate 1 mL/hr 0
--------------------------------------------
Upvotes: 1