Guillaume
Guillaume

Reputation: 6009

Python pandas: conditionally delete first row of each group

Using Python 3.6 and Pandas 0.19.2

I have a DataFrame such as this one:

   tid                datetime  event  data
0    0 2017-03-22 10:59:59.864  START   NaN
1    0 2017-03-22 10:59:59.931    END   NaN
2    0 2017-03-22 10:59:59.935  START   NaN
3    1 2017-03-22 10:59:59.939    END   NaN
4    0 2017-03-22 10:59:59.940    END   NaN
5    1 2017-03-22 10:59:59.941  START   NaN
6    1 2017-03-22 10:59:59.945    END   NaN
7    0 2017-03-22 10:59:59.947  START   NaN
8    1 2017-03-22 10:59:59.955  START   NaN

which contains start dates and end dates for transaction occurring inside threads (tid is the thread id). Sadly, the transaction themselves do not have an unique ID. So I need to group those rows by tid, order them by date, then take the lines 2 by 2, in order to have 1 START and 1 END for each transaction.

My current problem is that my initial dataframe may miss the first START event for each thread (in my above example, the line with index 3 is an END event with no previous START). I need to remove those END lines, but I don't know how to do that.

Same problem for the last START lines that do not have a matching END line.

Sample Input

import pandas as pd
import io
df = pd.read_csv(io.StringIO('''tid;datetime;event
0;2017-03-22 10:59:59.864;START
0;2017-03-22 10:59:59.931;END
0;2017-03-22 10:59:59.935;START
1;2017-03-22 10:59:59.939;END
0;2017-03-22 10:59:59.940;END
1;2017-03-22 10:59:59.941;START
1;2017-03-22 10:59:59.945;END
0;2017-03-22 10:59:59.947;START
1;2017-03-22 10:59:59.955;START'''), sep=';', parse_dates=['datetime'])

Expected output

Same dataframe but with the line #2 dropped, because it is the first line for Tid 1 and is not a START event:

   tid                datetime  event
0    0 2017-03-22 10:59:59.864  START
1    0 2017-03-22 10:59:59.931    END
3    1 2017-03-22 10:59:59.933  START
4    1 2017-03-22 10:59:59.945    END
5    0 2017-03-22 10:59:59.947  START
6    0 2017-03-22 10:59:59.955    END

BTW, bonus points if you end up with something like:

   tid          start_datetime           stop_datetime
0    0 2017-03-22 10:59:59.864 2017-03-22 10:59:59.931
1    1 2017-03-22 10:59:59.933 2017-03-22 10:59:59.945
2    0 2017-03-22 10:59:59.947 2017-03-22 10:59:59.955

What I have tried

df.sort(['tid', 'datetime']).groupby('tid').first().event == 'END' does not contain the initial index from my dataframe, so I cannot use it to drop the lines. (or, if I can, it is not obvious how to do that)

Upvotes: 0

Views: 1131

Answers (4)

Guillaume
Guillaume

Reputation: 6009

I managed to partially solve my problem this way:

# order events by thread id and datetime
df = df.sort_values(['tid', 'datetime']).reset_index(drop=True)
# then group by tid
for tid, group in df.groupby('tid'):
     # for each group, drop the first line if it is a END event
     head = group.head(1).iloc[0]
     if head.status == 'END':
         df.drop(head.name, inplace=True)
     # and drop the last line if it is a START event
     tail = group.tail(1).iloc[0]
     if tail.status == 'START':
         df.drop(tail.name, inplace=True)

# take lines 2 by 2, that will be a START and an END event, that can be aggregated
df.groupby(np.arange(len(df)) // 2).agg({'Tid': 'first', 'DateTime': {'start': 'min', 'stop': 'max'}})

Upvotes: 0

andrew_reece
andrew_reece

Reputation: 21264

Here's another approach, with a groupby() strategy based on this answer:

# make boolean mask to check for valid event entries
def valid_event(x):
    if x.name:
        return df.loc[x.name-1,'event']==x.event
    return False
mask = df.apply(check_event, axis='columns')

# subset with mask
df = (df.loc[~mask]
        .groupby(np.arange(len(df2))//2) # groupby every 2 rows
        .agg({'tid':{'tid':'first'},
              'datetime':{'start_datetime':'min',
                          'stop_datetime':'max'}
             })
      )

df.columns = df.columns.droplevel() # drop multi-index cols

print(df)

   tid          start_datetime           stop_datetime
0    0 2017-03-22 10:59:59.864 2017-03-22 10:59:59.931
1    1 2017-03-22 10:59:59.933 2017-03-22 10:59:59.945
2    0 2017-03-22 10:59:59.947 2017-03-22 10:59:59.955

Upvotes: 1

jezrael
jezrael

Reputation: 862591

You can use shift + cumsum for creating unique Series for grouping and then use custom function where select by query and iat, last reorder columns by reindex_axis:

a = (df.tid != df.tid.shift()).cumsum()

def f(x):
    start = x.query('event == "START"')['datetime'].iat[0]
    end = x.query('event == "END"')['datetime'].iat[-1]
    tid = x.tid.iat[0]
    return pd.Series({'tid':tid,'start_datetime':start,'end_datetime':end})

print(df.groupby(a, as_index=False).apply(f)
        .reindex_axis(['tid','start_datetime','end_datetime'], 1))

   tid              start_datetime                end_datetime
0    0  2017-03-22 10:59:59.864000  2017-03-22 10:59:59.931000
1    1  2017-03-22 10:59:59.933000  2017-03-22 10:59:59.945000
2    0  2017-03-22 10:59:59.947000  2017-03-22 10:59:59.955000

Another solution with boolean indexing instead query (maybe faster, query is better in larger df):

a = (df.tid != df.tid.shift()).cumsum()

def f(x):
    start = x.loc[df.event == "START", 'datetime'].iat[0]
    end = x.loc[df.event == "END", 'datetime'].iat[-1]

    tid = x.tid.iat[0]
    return pd.Series({'tid':tid,'start_datetime':start,'end_datetime':end})

print(df.groupby(a, as_index=False).apply(f)
        .reindex_axis(['tid','start_datetime','end_datetime'], 1))
   tid              start_datetime                end_datetime
0    0  2017-03-22 10:59:59.864000  2017-03-22 10:59:59.931000
1    1  2017-03-22 10:59:59.933000  2017-03-22 10:59:59.945000
2    0  2017-03-22 10:59:59.947000  2017-03-22 10:59:59.955000

Upvotes: 1

Scott Boston
Scott Boston

Reputation: 153460

One approach is (We can tidy up the custom function to handle more diverse inputs, but this works for the sample input.):

df = df.assign(group=(df.tid.diff().fillna(0) != 0).cumsum())

def myTwo(x):
    starttime = x.query('event == "START"')['datetime'].min()
    endtime = x.query('event == "END"')['datetime'].max()
    tid = x.tid.max()
    return pd.Series({'tid':tid,'start_datetime':starttime,'end_datetime':endtime})

print(df.groupby('group').apply(myTwo)[['tid','start_datetime','end_datetime']])

Output:

       tid              start_datetime                end_datetime
group                                                             
0        0  2017-03-22 10:59:59.864000  2017-03-22 10:59:59.931000
1        1  2017-03-22 10:59:59.933000  2017-03-22 10:59:59.945000
2        0  2017-03-22 10:59:59.947000  2017-03-22 10:59:59.955000

Upvotes: 1

Related Questions