Adarsh Chavakula
Adarsh Chavakula

Reputation: 1589

Python 3.6: Pandas - Obtain earlier occurence of an entry along column of a DataFrame

I have a pandas data frame with two columns - "Day" and "Event". On each day, an arbitrary number of events can occur, each denoted by a char. I want to construct an additional column indicating how many days have passed since an event last occurred. The code below can illustrate this:

>>> import pandas as pd
>>> event_data = pd.DataFrame({'Day':[1,1,1,1,2,2,2,3,3,4,4,4,4], 
                             'Event':['A','B','C','D','B','F','A','B','C','G','A','B','D']})
>>> print(event_data)

     Day Event
0     1     A
1     1     B
2     1     C
3     1     D
4     2     B
5     2     F
6     2     A
7     3     B
8     3     C
9     4     G
10    4     A
11    4     B
12    4     D

I want an additional column "days_since_event" which would look like:

    Day Event  Days_since_event
0     1     A                -1
1     1     B                -1
2     1     C                -1
3     1     D                -1
4     2     B                 1
5     2     F                -1
6     2     A                 1
7     3     B                 1
8     3     C                 2
9     4     G                -1
10    4     A                 2
11    4     B                 1
12    4     D                 3

Row 12 has a value 3 as event 'D' occurred on day 4 and the previous occurrence was on Day 1 implying that 3 days have passed since event 'D' last happened. Rows with value -1 indicate that the event is happening for the first time on that day.

What is a computationally optimum way of doing this? The original dataset has over 10 million rows and using loops or repeated data subsetting is too slow.

Upvotes: 0

Views: 65

Answers (1)

gereleth
gereleth

Reputation: 2482

You could group the dataframe by Event column and calculate a diff of Day like this:

event_data['Days_since_event'] = event_data.groupby('Event')['Day']\
                                           .diff().fillna(-1).astype(int)

First occurrences of every event will have NaN values, so I fill those with -1 as described in your question.

Upvotes: 3

Related Questions