Reputation: 1589
I have a pandas data frame with two columns - "Day" and "Event". On each day, an arbitrary number of events can occur, each denoted by a char. I want to construct an additional column indicating how many days have passed since an event last occurred. The code below can illustrate this:
>>> import pandas as pd
>>> event_data = pd.DataFrame({'Day':[1,1,1,1,2,2,2,3,3,4,4,4,4],
'Event':['A','B','C','D','B','F','A','B','C','G','A','B','D']})
>>> print(event_data)
Day Event
0 1 A
1 1 B
2 1 C
3 1 D
4 2 B
5 2 F
6 2 A
7 3 B
8 3 C
9 4 G
10 4 A
11 4 B
12 4 D
I want an additional column "days_since_event" which would look like:
Day Event Days_since_event
0 1 A -1
1 1 B -1
2 1 C -1
3 1 D -1
4 2 B 1
5 2 F -1
6 2 A 1
7 3 B 1
8 3 C 2
9 4 G -1
10 4 A 2
11 4 B 1
12 4 D 3
Row 12 has a value 3 as event 'D' occurred on day 4 and the previous occurrence was on Day 1 implying that 3 days have passed since event 'D' last happened. Rows with value -1 indicate that the event is happening for the first time on that day.
What is a computationally optimum way of doing this? The original dataset has over 10 million rows and using loops or repeated data subsetting is too slow.
Upvotes: 0
Views: 65
Reputation: 2482
You could group the dataframe by Event
column and calculate a diff
of Day
like this:
event_data['Days_since_event'] = event_data.groupby('Event')['Day']\
.diff().fillna(-1).astype(int)
First occurrences of every event will have NaN
values, so I fill those with -1 as described in your question.
Upvotes: 3