Reputation: 109
I am doing a classification problem in which I am trying to predict whether a car will be refuelled the following day.
The data consists of a date, an ID for every car, and a dummy variable indicating whether the car was refuelled on that specific date.
What I want to achieve is the column "days_since_refuelled". This should be calculated as the number of days since that last occurrence of refuelled==1, and has to be calculated individually for every car_id, obviously. If there has been no previous instance of refuelled==1, then the value should be equal to -1.
The desired output should look like this:
date car_id refuelled days_since_refuelled
01-01-2019 1 0 -1
01-01-2019 2 1 -1
01-01-2019 3 1 -1
06-01-2019 1 0 -1
06-01-2019 2 0 5
06-01-2019 3 0 5
09-01-2019 1 1 -1
09-01-2019 2 0 8
09-01-2019 3 0 8
14-01-2019 1 0 5
14-01-2019 2 1 13
14-01-2019 3 0 13
Upvotes: 4
Views: 448
Reputation: 862921
Convert rows with 1
to NaN
s by Series.where
, then Series.shift
with ffill
per groups, then subtract date
column by Series.sub
, last convert timedeltas by Series.dt.days
and replace missing values by Series.fillna
:
#convert to datetimes
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df['days_since_refuelled'] = df['date'].where(df['refuelled'].eq(1))
df['days_since_refuelled'] = (df['date'].sub(df.groupby('car_id')['days_since_refuelled']
.apply(lambda x: x.shift().ffill())
)
.dt.days
.fillna(-1)
.astype(int))
print (df)
date car_id refuelled days_since_refulled days_since_refuelled
0 2009-01-01 1 0 -1 -1
1 2019-01-01 2 1 -1 -1
2 2019-01-01 3 1 -1 -1
3 2019-01-06 1 0 -1 -1
4 2019-01-06 2 0 5 5
5 2019-01-06 3 0 5 5
6 2019-01-09 1 1 -1 -1
7 2019-01-09 2 0 8 8
8 2019-01-09 3 0 8 8
9 2019-01-14 1 0 5 5
10 2019-01-14 2 1 13 13
11 2019-01-14 3 0 13 13
Upvotes: 6