Selecting and eliminating rows of a dataframe that have timestamp too close to other rows

Question

I have a dataframe with data from an equipment. Sometimes the start of the equipment fail and will generate a line of data that the action was done. Then the equipment will try to start again a few seconds later and in most cases within 2 or 3 trials it succeeds.

The problem is that the retries and successes all goes to the same table with no distinction between false starts and real starts. As the equipment starts only once every few hours, all I have to do is find all the rows that have a similar timestamp (inside an interval of 2 minutes for example) and keep only the last one.

The task is to eliminate those "false starts" from the dataframe.

The dataframe is ordered by those timestamps so the index of those "false starts" will be a sequence. It can be done for one equipment by iterating and eliminating if:

df.timestamp_local.iloc[i]-df.timestamp_local.iloc[i-1] =< 'some timedelta'

But is impractical to do when running over thousands of equipments.

Input example of the dataframe where the last 3 ones are one case with only the last row as a "real start":

device_name timestamp_local tk_event_desc
0   A005    2019-08-29 19:14:57 Start
1   A005    2019-09-03 09:11:37 Start
2   A005    2019-09-06 14:06:30 Start
3   A005    2019-09-09 17:39:17 Start
4   A005    2019-09-12 10:43:33 Start
5   A005    2019-09-12 17:07:08 Start
6   A005    2019-09-13 01:18:36 Start
7   A005    2019-09-13 13:20:40 Start
8   A005    2019-09-17 17:54:44 Start
9   A005    2019-09-21 12:29:47 Start
10  A005    2019-09-22 11:58:26 Start
11  A005    2019-09-22 11:58:27 Start
12  A005    2019-09-22 11:58:29 Start

Shaido · Accepted Answer

This could be done using groupby on the device column and a wanted time interval. For example, using 2 minutes:

# Make sure the column is datetime type.
df['timestamp_local'] = pd.to_datetime(df['timestamp_local'])

# Copy the time column to  keep the actual timestamp values in the result.
df['time_group'] = df['timestamp_local']

freq = '2Min'
df.groupby(['device_name', pd.Grouper(key='time_group', freq=freq)]) \
  .last() \
  .reset_index() \
  .drop(columns=['time_group'])

Result with the provided data:

   device_name      timestamp_local   tk_event_desc
0         A005  2019-08-29 19:14:57           Start
1         A005  2019-09-03 09:11:37           Start
2         A005  2019-09-06 14:06:30           Start
3         A005  2019-09-09 17:39:17           Start
4         A005  2019-09-12 10:43:33           Start
5         A005  2019-09-12 17:07:08           Start
6         A005  2019-09-13 01:18:36           Start
7         A005  2019-09-13 13:20:40           Start
8         A005  2019-09-17 17:54:44           Start
9         A005  2019-09-21 12:29:47           Start
10        A005  2019-09-22 11:58:29           Start

Selecting and eliminating rows of a dataframe that have timestamp too close to other rows

Answers (2)

Related Questions