Reputation: 25
I have a dataframe with data from an equipment. Sometimes the start of the equipment fail and will generate a line of data that the action was done. Then the equipment will try to start again a few seconds later and in most cases within 2 or 3 trials it succeeds.
The problem is that the retries and successes all goes to the same table with no distinction between false starts and real starts. As the equipment starts only once every few hours, all I have to do is find all the rows that have a similar timestamp (inside an interval of 2 minutes for example) and keep only the last one.
The task is to eliminate those "false starts" from the dataframe.
The dataframe is ordered by those timestamps so the index of those "false starts" will be a sequence. It can be done for one equipment by iterating and eliminating if:
df.timestamp_local.iloc[i]-df.timestamp_local.iloc[i-1] =< 'some timedelta'
But is impractical to do when running over thousands of equipments.
Input example of the dataframe where the last 3 ones are one case with only the last row as a "real start":
device_name timestamp_local tk_event_desc
0 A005 2019-08-29 19:14:57 Start
1 A005 2019-09-03 09:11:37 Start
2 A005 2019-09-06 14:06:30 Start
3 A005 2019-09-09 17:39:17 Start
4 A005 2019-09-12 10:43:33 Start
5 A005 2019-09-12 17:07:08 Start
6 A005 2019-09-13 01:18:36 Start
7 A005 2019-09-13 13:20:40 Start
8 A005 2019-09-17 17:54:44 Start
9 A005 2019-09-21 12:29:47 Start
10 A005 2019-09-22 11:58:26 Start
11 A005 2019-09-22 11:58:27 Start
12 A005 2019-09-22 11:58:29 Start
Upvotes: 1
Views: 639
Reputation: 634
Iterating a python loop over dataframes are highly discouraged. For why see https://stackoverflow.com/a/55557758/8479618. You should almost always use built in operations as they are optimized under-the-hood.
I'm going to assume your timestamps are already converted to datetimes. If not, use pandas.to_datetime()
.
Let df
be your above original dataframe. We first find the difference between each timestamp and shift it upwards.
df['time_diff'] = tf.timestamp_local.diff().shift(-1)
Then we check if the timedelta is below our threshold, and if it is we don't select it with our boolean mask (Note: I imported timedelta
with from datetime import timedelta
). In this example I set the threshold at 2 minutes, meaning if a start is within 2 minutes of another start I see it as a false start.
df[
(df['time_diff']>timedelta(minutes=2)) | #Checks for threshold
(df['time_diff'].isnull()) #handles the last null value
]
Upvotes: 0
Reputation: 28322
This could be done using groupby on the device column and a wanted time interval. For example, using 2 minutes:
# Make sure the column is datetime type.
df['timestamp_local'] = pd.to_datetime(df['timestamp_local'])
# Copy the time column to keep the actual timestamp values in the result.
df['time_group'] = df['timestamp_local']
freq = '2Min'
df.groupby(['device_name', pd.Grouper(key='time_group', freq=freq)]) \
.last() \
.reset_index() \
.drop(columns=['time_group'])
Result with the provided data:
device_name timestamp_local tk_event_desc
0 A005 2019-08-29 19:14:57 Start
1 A005 2019-09-03 09:11:37 Start
2 A005 2019-09-06 14:06:30 Start
3 A005 2019-09-09 17:39:17 Start
4 A005 2019-09-12 10:43:33 Start
5 A005 2019-09-12 17:07:08 Start
6 A005 2019-09-13 01:18:36 Start
7 A005 2019-09-13 13:20:40 Start
8 A005 2019-09-17 17:54:44 Start
9 A005 2019-09-21 12:29:47 Start
10 A005 2019-09-22 11:58:29 Start
Upvotes: 2