Reputation: 6009
Using Python 3.6 and Pandas 0.19.2
I have a DataFrame such as this one:
tid datetime event data
0 0 2017-03-22 10:59:59.864 START NaN
1 0 2017-03-22 10:59:59.931 END NaN
2 0 2017-03-22 10:59:59.935 START NaN
3 1 2017-03-22 10:59:59.939 END NaN
4 0 2017-03-22 10:59:59.940 END NaN
5 1 2017-03-22 10:59:59.941 START NaN
6 1 2017-03-22 10:59:59.945 END NaN
7 0 2017-03-22 10:59:59.947 START NaN
8 1 2017-03-22 10:59:59.955 START NaN
which contains start dates and end dates for transaction occurring inside threads (tid is the thread id). Sadly, the transaction themselves do not have an unique ID. So I need to group those rows by tid, order them by date, then take the lines 2 by 2, in order to have 1 START and 1 END for each transaction.
My current problem is that my initial dataframe may miss the first START event for each thread (in my above example, the line with index 3 is an END event with no previous START). I need to remove those END lines, but I don't know how to do that.
Same problem for the last START lines that do not have a matching END line.
Sample Input
import pandas as pd
import io
df = pd.read_csv(io.StringIO('''tid;datetime;event
0;2017-03-22 10:59:59.864;START
0;2017-03-22 10:59:59.931;END
0;2017-03-22 10:59:59.935;START
1;2017-03-22 10:59:59.939;END
0;2017-03-22 10:59:59.940;END
1;2017-03-22 10:59:59.941;START
1;2017-03-22 10:59:59.945;END
0;2017-03-22 10:59:59.947;START
1;2017-03-22 10:59:59.955;START'''), sep=';', parse_dates=['datetime'])
Expected output
Same dataframe but with the line #2 dropped, because it is the first line for Tid 1 and is not a START event:
tid datetime event
0 0 2017-03-22 10:59:59.864 START
1 0 2017-03-22 10:59:59.931 END
3 1 2017-03-22 10:59:59.933 START
4 1 2017-03-22 10:59:59.945 END
5 0 2017-03-22 10:59:59.947 START
6 0 2017-03-22 10:59:59.955 END
BTW, bonus points if you end up with something like:
tid start_datetime stop_datetime
0 0 2017-03-22 10:59:59.864 2017-03-22 10:59:59.931
1 1 2017-03-22 10:59:59.933 2017-03-22 10:59:59.945
2 0 2017-03-22 10:59:59.947 2017-03-22 10:59:59.955
What I have tried
df.sort(['tid', 'datetime']).groupby('tid').first().event == 'END'
does not contain the initial index from my dataframe, so I cannot use it to drop the lines. (or, if I can, it is not obvious how to do that)
Upvotes: 0
Views: 1131
Reputation: 6009
I managed to partially solve my problem this way:
# order events by thread id and datetime
df = df.sort_values(['tid', 'datetime']).reset_index(drop=True)
# then group by tid
for tid, group in df.groupby('tid'):
# for each group, drop the first line if it is a END event
head = group.head(1).iloc[0]
if head.status == 'END':
df.drop(head.name, inplace=True)
# and drop the last line if it is a START event
tail = group.tail(1).iloc[0]
if tail.status == 'START':
df.drop(tail.name, inplace=True)
# take lines 2 by 2, that will be a START and an END event, that can be aggregated
df.groupby(np.arange(len(df)) // 2).agg({'Tid': 'first', 'DateTime': {'start': 'min', 'stop': 'max'}})
Upvotes: 0
Reputation: 21264
Here's another approach, with a groupby()
strategy based on this answer:
# make boolean mask to check for valid event entries
def valid_event(x):
if x.name:
return df.loc[x.name-1,'event']==x.event
return False
mask = df.apply(check_event, axis='columns')
# subset with mask
df = (df.loc[~mask]
.groupby(np.arange(len(df2))//2) # groupby every 2 rows
.agg({'tid':{'tid':'first'},
'datetime':{'start_datetime':'min',
'stop_datetime':'max'}
})
)
df.columns = df.columns.droplevel() # drop multi-index cols
print(df)
tid start_datetime stop_datetime
0 0 2017-03-22 10:59:59.864 2017-03-22 10:59:59.931
1 1 2017-03-22 10:59:59.933 2017-03-22 10:59:59.945
2 0 2017-03-22 10:59:59.947 2017-03-22 10:59:59.955
Upvotes: 1
Reputation: 862591
You can use shift
+ cumsum
for creating unique Series
for grouping and then use custom function where select by query
and iat
, last reorder columns by reindex_axis
:
a = (df.tid != df.tid.shift()).cumsum()
def f(x):
start = x.query('event == "START"')['datetime'].iat[0]
end = x.query('event == "END"')['datetime'].iat[-1]
tid = x.tid.iat[0]
return pd.Series({'tid':tid,'start_datetime':start,'end_datetime':end})
print(df.groupby(a, as_index=False).apply(f)
.reindex_axis(['tid','start_datetime','end_datetime'], 1))
tid start_datetime end_datetime
0 0 2017-03-22 10:59:59.864000 2017-03-22 10:59:59.931000
1 1 2017-03-22 10:59:59.933000 2017-03-22 10:59:59.945000
2 0 2017-03-22 10:59:59.947000 2017-03-22 10:59:59.955000
Another solution with boolean indexing
instead query
(maybe faster, query
is better in larger df
):
a = (df.tid != df.tid.shift()).cumsum()
def f(x):
start = x.loc[df.event == "START", 'datetime'].iat[0]
end = x.loc[df.event == "END", 'datetime'].iat[-1]
tid = x.tid.iat[0]
return pd.Series({'tid':tid,'start_datetime':start,'end_datetime':end})
print(df.groupby(a, as_index=False).apply(f)
.reindex_axis(['tid','start_datetime','end_datetime'], 1))
tid start_datetime end_datetime
0 0 2017-03-22 10:59:59.864000 2017-03-22 10:59:59.931000
1 1 2017-03-22 10:59:59.933000 2017-03-22 10:59:59.945000
2 0 2017-03-22 10:59:59.947000 2017-03-22 10:59:59.955000
Upvotes: 1
Reputation: 153460
One approach is (We can tidy up the custom function to handle more diverse inputs, but this works for the sample input.):
df = df.assign(group=(df.tid.diff().fillna(0) != 0).cumsum())
def myTwo(x):
starttime = x.query('event == "START"')['datetime'].min()
endtime = x.query('event == "END"')['datetime'].max()
tid = x.tid.max()
return pd.Series({'tid':tid,'start_datetime':starttime,'end_datetime':endtime})
print(df.groupby('group').apply(myTwo)[['tid','start_datetime','end_datetime']])
Output:
tid start_datetime end_datetime
group
0 0 2017-03-22 10:59:59.864000 2017-03-22 10:59:59.931000
1 1 2017-03-22 10:59:59.933000 2017-03-22 10:59:59.945000
2 0 2017-03-22 10:59:59.947000 2017-03-22 10:59:59.955000
Upvotes: 1