Reputation: 816
If I have a pandas DataFrame with a column of date times like this:
2018-01-16 01:51:10
2018-01-16 01:57:58
2018-01-16 01:57:58
2018-01-16 01:57:58
2018-01-16 01:57:58
2018-01-16 01:57:58
2018-01-16 01:57:59
2018-01-16 01:57:59
2018-01-16 01:57:59
2018-01-16 01:58:00
2018-01-16 01:58:00
2018-01-16 01:58:01
2018-01-16 01:58:01
2018-01-16 02:00:00
2018-01-16 02:45:01
2018-01-16 02:45:01
2018-01-16 02:45:02
2018-01-16 02:45:02
2018-01-16 02:45:02
2018-01-16 02:45:02
2018-01-16 02:45:03
2018-01-16 02:45:04
2018-01-16 02:45:04
2018-01-16 02:45:05
2018-01-16 02:45:06
2018-01-16 02:45:07
2018-01-16 02:45:07
2018-01-16 02:45:08
2018-01-16 02:58:55
How do I remove times that are not in groupings/outlier times such that if the time is more then 1 second away from a group of times it gets discarded. I want to get a pandas data frame like this:
2018-01-16 01:57:58
2018-01-16 01:57:58
2018-01-16 01:57:58
2018-01-16 01:57:58
2018-01-16 01:57:58
2018-01-16 01:57:59
2018-01-16 01:57:59
2018-01-16 01:57:59
2018-01-16 01:58:00
2018-01-16 01:58:00
2018-01-16 01:58:01
2018-01-16 01:58:01
2018-01-16 02:45:01
2018-01-16 02:45:01
2018-01-16 02:45:02
2018-01-16 02:45:02
2018-01-16 02:45:02
2018-01-16 02:45:02
2018-01-16 02:45:03
2018-01-16 02:45:04
2018-01-16 02:45:04
2018-01-16 02:45:05
2018-01-16 02:45:06
2018-01-16 02:45:07
2018-01-16 02:45:07
2018-01-16 02:45:08
Upvotes: 0
Views: 43
Reputation: 40908
Sample data:
import pandas as pd
idx = [
"2018-01-16 01:51:10", "2018-01-16 01:57:58", "2018-01-16 01:57:58", "2018-01-16 01:57:58", "2018-01-16 01:57:58", "2018-01-16 01:57:58", "2018-01-16 01:57:59", "2018-01-16 01:57:59", "2018-01-16 01:57:59", "2018-01-16 01:58:00", "2018-01-16 01:58:00", "2018-01-16 01:58:01", "2018-01-16 01:58:01", "2018-01-16 02:00:00", "2018-01-16 02:45:01", "2018-01-16 02:45:01", "2018-01-16 02:45:02", "2018-01-16 02:45:02", "2018-01-16 02:45:02", "2018-01-16 02:45:02", "2018-01-16 02:45:03", "2018-01-16 02:45:04", "2018-01-16 02:45:04", "2018-01-16 02:45:05", "2018-01-16 02:45:06", "2018-01-16 02:45:07", "2018-01-16 02:45:07", "2018-01-16 02:45:08", "2018-01-16 02:58:55",
]
df = pd.DataFrame(range(len(idx)), index=idx, columns=["col"])
df.index = pd.to_datetime(df.index)
Take the first-differences in both directions:
bdiff = np.abs((df.index[1:] - df.index[:-1]).total_seconds())
fdiff = np.abs((df.index[:-1] - df.index[1:]).total_seconds())
Now, each of these is effectively one element too short; bdiff
excludes the first element, while fdiff
excludes the last. So, pad them:
bdiff = np.insert(bdiff, 0, 99.)
fdiff = np.append(fdiff, 99.)
Finally, form a boolean mask from this; the mask is True
where a value is more than one second away from both the values that surround it:
>>> mask = (bdiff > 1.0) & (fdiff > 1.0)
... print(df.loc[~mask])
col
2018-01-16 01:57:58 1
2018-01-16 01:57:58 2
2018-01-16 01:57:58 3
2018-01-16 01:57:58 4
2018-01-16 01:57:58 5
2018-01-16 01:57:59 6
2018-01-16 01:57:59 7
2018-01-16 01:57:59 8
2018-01-16 01:58:00 9
2018-01-16 01:58:00 10
2018-01-16 01:58:01 11
2018-01-16 01:58:01 12
2018-01-16 02:45:01 14
2018-01-16 02:45:01 15
2018-01-16 02:45:02 16
2018-01-16 02:45:02 17
2018-01-16 02:45:02 18
2018-01-16 02:45:02 19
2018-01-16 02:45:03 20
2018-01-16 02:45:04 21
2018-01-16 02:45:04 22
2018-01-16 02:45:05 23
2018-01-16 02:45:06 24
2018-01-16 02:45:07 25
2018-01-16 02:45:07 26
2018-01-16 02:45:08 27
>>> mask
array([ True, False, False, False, False, False, False, False, False,
False, False, False, False, True, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, True])
Upvotes: 1