Adrian Y
Adrian Y

Reputation: 414

Filtering out consecutive rows from dataset

I have a dataset with the index values and one datetime variable as follows:

1      2017-01-03 09:30:01.958
46     2017-01-03 09:30:47.879
99     2017-01-03 09:33:48.121
117    2017-01-03 09:47:06.215
139    2017-01-03 09:51:06.054
1567   2017-01-03 14:17:18.949
2480   2017-01-03 15:57:13.442
2481   2017-01-03 15:57:14.333
2486   2017-01-03 15:57:37.500
2487   2017-01-03 15:57:38.677
2489   2017-01-03 15:57:41.053
2491   2017-01-03 15:57:54.870
2498   2017-01-03 15:59:24.210

What I am trying to do is to remove consecutive rows from the data (only retaining the first observation from the segment), in this case the code should drop the rows with index 2481 and 2487. I tried using

df[df.index.diff() == 0].drop()

but it only returned

AttributeError: 'Int64Index' object has no attribute 'diff'

Upvotes: 1

Views: 242

Answers (2)

piRSquared
piRSquared

Reputation: 294488

more numpy

df[np.append(True, df.index.values[1:] != df.index.values[:-1] + 1)]

                        date
1    2017-01-03 09:30:01.958
46   2017-01-03 09:30:47.879
99   2017-01-03 09:33:48.121
117  2017-01-03 09:47:06.215
139  2017-01-03 09:51:06.054
1567 2017-01-03 14:17:18.949
2480 2017-01-03 15:57:13.442
2486 2017-01-03 15:57:37.500
2489 2017-01-03 15:57:41.053
2491 2017-01-03 15:57:54.870
2498 2017-01-03 15:59:24.210

Upvotes: 1

jezrael
jezrael

Reputation: 863166

You can use boolean indexing, for working with index with not implemented method use to_series:

df = df[df.index.to_series().diff() != 1]
print (df)
                         date
1     2017-01-03 09:30:01.958
46    2017-01-03 09:30:47.879
99    2017-01-03 09:33:48.121
117   2017-01-03 09:47:06.215
139   2017-01-03 09:51:06.054
1567  2017-01-03 14:17:18.949
2480  2017-01-03 15:57:13.442
2486  2017-01-03 15:57:37.500
2489  2017-01-03 15:57:41.053
2491  2017-01-03 15:57:54.870
2498  2017-01-03 15:59:24.210

Thank you piRSquared for numpy alternative:

df[np.append(0, np.diff(df.index.values)) != 1]

Timings:

#[11000 rows x 1 columns]
df = pd.concat([df]*1000)

In [60]: %timeit [True] + [(i[0]+1) != i[1] for i in zip(df.index.tolist(), df.index.tolist()[1:])]
100 loops, best of 3: 4.19 ms per loop

In [61]: %timeit np.append(0, np.diff(df.index.values)) != 1
The slowest run took 4.72 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 33.1 µs per loop

In [62]: %timeit df.index.to_series().diff() != 1
1000 loops, best of 3: 260 µs per loop

Upvotes: 3

Related Questions