Reputation: 608
I'm trying to calculate running difference on the date column depending on "event column".
So, to add another column with date difference between 1
in event column (there only 0
and 1
).
Spo far I came to this half-working crappy solution
Dataframe:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0],'duration':None})
Code:
x = df.loc[df['event']==1, 'date']
k = 0
for i in range(len(x)):
df.loc[k:x.index[i], 'duration'] = x.iloc[i] - k
k = x.index[i]
But I'm sure there is a more elegant solution. Thanks for any advice.
Output format:
+------+-------+----------+
| date | event | duration |
+------+-------+----------+
| 1 | 0 | 3 |
| 2 | 0 | 3 |
| 3 | 1 | 3 |
| 4 | 0 | 6 |
| 5 | 0 | 6 |
| 6 | 0 | 6 |
| 7 | 0 | 6 |
| 8 | 0 | 6 |
| 9 | 1 | 6 |
| 10 | 0 | 4 |
| 11 | 0 | 4 |
| 12 | 0 | 4 |
| 13 | 1 | 4 |
| 14 | 0 | 2 |
| 15 | 1 | 2 |
+------+-------+----------+
Upvotes: 0
Views: 120
Reputation: 1012
Using your initial dataframe:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0],'duration':None})
Add an index-like column to mark where the transitions occur (you could also base this on the date
column if it is unique):
df = df.reset_index().rename(columns={'index':'idx'})
df.loc[df['event']==0, 'idx'] = np.nan
df['idx'] = df['idx'].fillna(method='bfill')
Then, use a groupby()
to count the records, and backfill them to match your structure:
df['duration'] = df.groupby('idx')['event'].count()
df['duration'] = df['duration'].fillna(method='bfill')
# Alternatively, the previous two lines can be combined as pointed out by OP
# df['duration'] = df.groupby('idx')['event'].transform('count')
df = df.drop(columns='idx')
print(df)
date event duration
0 1 0 2.0
1 2 1 2.0
2 3 0 3.0
3 4 0 3.0
4 5 1 3.0
5 6 0 5.0
6 7 0 5.0
7 8 0 5.0
8 9 0 5.0
9 10 1 5.0
10 11 0 6.0
11 12 0 6.0
12 13 0 6.0
13 14 0 6.0
14 15 0 6.0
15 16 1 6.0
16 17 0 NaN
It ends up as a float value because of the NaN in the last row. This approach works well in general if there are obvious "groups" of things to count.
As an alternative, because the dates are already there as integers you can look at the differences in dates directly:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0]})
tmp = df[df['event']==1].copy()
tmp['duration'] = (tmp['date'] - tmp['date'].shift(1)).fillna(tmp['date'])
df = pd.merge(df, tmp[['date','duration']], on='date', how='left').fillna(method='bfill')
Upvotes: 1