Reputation: 414
I have a dataset with the index values and one datetime variable as follows:
1 2017-01-03 09:30:01.958
46 2017-01-03 09:30:47.879
99 2017-01-03 09:33:48.121
117 2017-01-03 09:47:06.215
139 2017-01-03 09:51:06.054
1567 2017-01-03 14:17:18.949
2480 2017-01-03 15:57:13.442
2481 2017-01-03 15:57:14.333
2486 2017-01-03 15:57:37.500
2487 2017-01-03 15:57:38.677
2489 2017-01-03 15:57:41.053
2491 2017-01-03 15:57:54.870
2498 2017-01-03 15:59:24.210
What I am trying to do is to remove consecutive rows from the data (only retaining the first observation from the segment), in this case the code should drop the rows with index 2481 and 2487. I tried using
df[df.index.diff() == 0].drop()
but it only returned
AttributeError: 'Int64Index' object has no attribute 'diff'
Upvotes: 1
Views: 242
Reputation: 294488
more numpy
df[np.append(True, df.index.values[1:] != df.index.values[:-1] + 1)]
date
1 2017-01-03 09:30:01.958
46 2017-01-03 09:30:47.879
99 2017-01-03 09:33:48.121
117 2017-01-03 09:47:06.215
139 2017-01-03 09:51:06.054
1567 2017-01-03 14:17:18.949
2480 2017-01-03 15:57:13.442
2486 2017-01-03 15:57:37.500
2489 2017-01-03 15:57:41.053
2491 2017-01-03 15:57:54.870
2498 2017-01-03 15:59:24.210
Upvotes: 1
Reputation: 863166
You can use boolean indexing
, for working with index
with not implemented method use to_series
:
df = df[df.index.to_series().diff() != 1]
print (df)
date
1 2017-01-03 09:30:01.958
46 2017-01-03 09:30:47.879
99 2017-01-03 09:33:48.121
117 2017-01-03 09:47:06.215
139 2017-01-03 09:51:06.054
1567 2017-01-03 14:17:18.949
2480 2017-01-03 15:57:13.442
2486 2017-01-03 15:57:37.500
2489 2017-01-03 15:57:41.053
2491 2017-01-03 15:57:54.870
2498 2017-01-03 15:59:24.210
Thank you piRSquared for numpy alternative:
df[np.append(0, np.diff(df.index.values)) != 1]
Timings:
#[11000 rows x 1 columns]
df = pd.concat([df]*1000)
In [60]: %timeit [True] + [(i[0]+1) != i[1] for i in zip(df.index.tolist(), df.index.tolist()[1:])]
100 loops, best of 3: 4.19 ms per loop
In [61]: %timeit np.append(0, np.diff(df.index.values)) != 1
The slowest run took 4.72 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 33.1 µs per loop
In [62]: %timeit df.index.to_series().diff() != 1
1000 loops, best of 3: 260 µs per loop
Upvotes: 3