GKC
GKC

Reputation: 479

Compute duration for multiple scenarios in pandas

I have a dataframe with multiple ids and I want to slice it by a certain sliding window frame and compute duration of each id that appears in that window. Certain time slices have only a single id and others have multiple ids.

For instances where multiple ids occur, I can capture the duration for each id as below.

Dataframe with multiple ids

id,date,value
1,2012-01-01 00:09:45,1
1,2012-01-01 00:09:50,1
2,2012-01-01 00:09:55,1
2,2012-01-01 00:10:00,1
2,2012-01-01 00:30:10,1
2,2012-01-01 00:30:15,1
3,2012-01-01 00:30:20,1
3,2012-01-01 00:30:25,1
3,2012-01-01 00:30:30,1
1,2012-01-01 00:30:45,1


import pandas as pd
df = pd.read_csv('df.csv')

df['date'] = pd.to_datetime(df['date'])
diff_ids = df['id'] != df['id'].shift(1)
df = df[diff_ids].copy()
df['start'] = df['date']
df['end'] = df['date'].shift(-1)
df['duration'] = df['end'] - df['start']
print(df)

Output

id date                 value  start                 end                  duration
1  2012-01-01 00:09:45  1      2012-01-01 00:09:45   2012-01-01 00:09:55  00:00:10
2  2012-01-01 00:09:55  1      2012-01-01 00:09:55   2012-01-01 00:30:20  00:20:25
3  2012-01-01 00:30:20  1      2012-01-01 00:30:20   2012-01-01 00:30:45  00:00:25
1  2012-01-01 00:30:45  1      2012-01-01 00:30:45   NaT                  NaT

In the same logic above, how can I also cater for the below case where only a single id appears

Dataframe with single id

id,date,value
2,2012-01-01 00:09:45,1
2,2012-01-01 00:09:50,1
2,2012-01-01 00:09:55,1
2,2012-01-01 00:10:00,1
2,2012-01-01 00:30:10,1
2,2012-01-01 00:30:15,1
2,2012-01-01 00:30:20,1
2,2012-01-01 00:30:25,1
2,2012-01-01 00:30:30,1
2,2012-01-01 00:30:45,1

Expected output:

id date                 value  start                 end                  duration
2  2012-01-01 00:09:45  1      2012-01-01 00:09:45   2012-01-01 00:30:45  00:21:10

Upvotes: 0

Views: 57

Answers (1)

not_speshal
not_speshal

Reputation: 23166

If there is just one single ID, you could just do this:

>>> df.sort_values("date").head(1).assign(start=df["date"].min(), end= df["date"].max(), duration=df["date"].max()-df["date"].min())
id date                 value  start                 end                  duration
2  2012-01-01 00:09:45  1      2012-01-01 00:09:45   2012-01-01 00:30:45  0 days 00:21:00

Upvotes: 1

Related Questions