Reputation: 479
I have a dataframe with multiple ids and I want to slice it by a certain sliding window frame and compute duration of each id that appears in that window. Certain time slices have only a single id and others have multiple ids.
For instances where multiple ids occur, I can capture the duration for each id as below.
Dataframe with multiple ids
id,date,value
1,2012-01-01 00:09:45,1
1,2012-01-01 00:09:50,1
2,2012-01-01 00:09:55,1
2,2012-01-01 00:10:00,1
2,2012-01-01 00:30:10,1
2,2012-01-01 00:30:15,1
3,2012-01-01 00:30:20,1
3,2012-01-01 00:30:25,1
3,2012-01-01 00:30:30,1
1,2012-01-01 00:30:45,1
import pandas as pd
df = pd.read_csv('df.csv')
df['date'] = pd.to_datetime(df['date'])
diff_ids = df['id'] != df['id'].shift(1)
df = df[diff_ids].copy()
df['start'] = df['date']
df['end'] = df['date'].shift(-1)
df['duration'] = df['end'] - df['start']
print(df)
Output
id date value start end duration
1 2012-01-01 00:09:45 1 2012-01-01 00:09:45 2012-01-01 00:09:55 00:00:10
2 2012-01-01 00:09:55 1 2012-01-01 00:09:55 2012-01-01 00:30:20 00:20:25
3 2012-01-01 00:30:20 1 2012-01-01 00:30:20 2012-01-01 00:30:45 00:00:25
1 2012-01-01 00:30:45 1 2012-01-01 00:30:45 NaT NaT
In the same logic above, how can I also cater for the below case where only a single id appears
Dataframe with single id
id,date,value
2,2012-01-01 00:09:45,1
2,2012-01-01 00:09:50,1
2,2012-01-01 00:09:55,1
2,2012-01-01 00:10:00,1
2,2012-01-01 00:30:10,1
2,2012-01-01 00:30:15,1
2,2012-01-01 00:30:20,1
2,2012-01-01 00:30:25,1
2,2012-01-01 00:30:30,1
2,2012-01-01 00:30:45,1
Expected output:
id date value start end duration
2 2012-01-01 00:09:45 1 2012-01-01 00:09:45 2012-01-01 00:30:45 00:21:10
Upvotes: 0
Views: 57
Reputation: 23166
If there is just one single ID, you could just do this:
>>> df.sort_values("date").head(1).assign(start=df["date"].min(), end= df["date"].max(), duration=df["date"].max()-df["date"].min())
id date value start end duration
2 2012-01-01 00:09:45 1 2012-01-01 00:09:45 2012-01-01 00:30:45 0 days 00:21:00
Upvotes: 1