Alex_Y
Alex_Y

Reputation: 608

Dataframe: calculate difference in dates column by another column

I'm trying to calculate running difference on the date column depending on "event column". So, to add another column with date difference between 1 in event column (there only 0 and 1).

Spo far I came to this half-working crappy solution

Dataframe:

    df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0],'duration':None})

Code:

x = df.loc[df['event']==1, 'date']
k = 0
for i in range(len(x)):
    df.loc[k:x.index[i], 'duration'] = x.iloc[i] - k
    k = x.index[i]

But I'm sure there is a more elegant solution. Thanks for any advice.

Output format:

+------+-------+----------+
| date | event | duration |
+------+-------+----------+
|    1 |     0 |        3 |
|    2 |     0 |        3 |
|    3 |     1 |        3 |
|    4 |     0 |        6 |
|    5 |     0 |        6 |
|    6 |     0 |        6 |
|    7 |     0 |        6 |
|    8 |     0 |        6 |
|    9 |     1 |        6 |
|   10 |     0 |        4 |
|   11 |     0 |        4 |
|   12 |     0 |        4 |
|   13 |     1 |        4 |
|   14 |     0 |        2 |
|   15 |     1 |        2 |
+------+-------+----------+

Upvotes: 0

Views: 120

Answers (1)

Rick M
Rick M

Reputation: 1012

Using your initial dataframe:

df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0],'duration':None})

Add an index-like column to mark where the transitions occur (you could also base this on the date column if it is unique):

df = df.reset_index().rename(columns={'index':'idx'})
df.loc[df['event']==0, 'idx'] = np.nan
df['idx'] = df['idx'].fillna(method='bfill')

Then, use a groupby() to count the records, and backfill them to match your structure:

df['duration'] = df.groupby('idx')['event'].count() 
df['duration'] = df['duration'].fillna(method='bfill') 

# Alternatively, the previous two lines can be combined as pointed out by OP
# df['duration'] = df.groupby('idx')['event'].transform('count')

df = df.drop(columns='idx')

print(df)
    date  event  duration
0      1      0       2.0
1      2      1       2.0
2      3      0       3.0
3      4      0       3.0
4      5      1       3.0
5      6      0       5.0
6      7      0       5.0
7      8      0       5.0
8      9      0       5.0
9     10      1       5.0
10    11      0       6.0
11    12      0       6.0
12    13      0       6.0
13    14      0       6.0
14    15      0       6.0
15    16      1       6.0
16    17      0       NaN

It ends up as a float value because of the NaN in the last row. This approach works well in general if there are obvious "groups" of things to count.

As an alternative, because the dates are already there as integers you can look at the differences in dates directly:

df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0]})

tmp = df[df['event']==1].copy()
tmp['duration'] = (tmp['date'] - tmp['date'].shift(1)).fillna(tmp['date'])
df = pd.merge(df, tmp[['date','duration']], on='date', how='left').fillna(method='bfill')

Upvotes: 1

Related Questions