Basj
Basj

Reputation: 46473

Keeping only data for which timedelta=1minute with pandas

Let's generate 10 rows of a time series with non-constant time step :

import pandas as pd
import numpy as np
x = pd.DataFrame(np.random.random(10),pd.date_range('1/1/2011', periods=5, freq='1min') \
                   .union(pd.date_range('1/2/2011', periods=5, freq='1min')))

Example of data:

2011-01-01 00:00:00  0.144852
2011-01-01 00:01:00  0.510248
2011-01-01 00:02:00  0.911903
2011-01-01 00:03:00  0.392504
2011-01-01 00:04:00  0.054307
2011-01-02 00:00:00  0.918862
2011-01-02 00:01:00  0.988054
2011-01-02 00:02:00  0.780668
2011-01-02 00:03:00  0.831947
2011-01-02 00:04:00  0.707357

Now let's define r as the so-called "returns" (difference between consecutive rows):

r = x[1:] - x[:-1].values

How to clean the data by removing the r[i] for which the time difference was not 1 minute? (here there is exactly one such row in r to clean)

Upvotes: 1

Views: 124

Answers (1)

EdChum
EdChum

Reputation: 394041

IIUC I think you want the following:

In [26]:
x[(x.index.to_series().diff() == pd.Timedelta(1, 'm')) | (x.index.to_series().diff().isnull())]

Out[26]:
                            0
2011-01-01 00:00:00  0.367675
2011-01-01 00:01:00  0.128325
2011-01-01 00:02:00  0.772191
2011-01-01 00:03:00  0.638847
2011-01-01 00:04:00  0.476668
2011-01-02 00:01:00  0.992888
2011-01-02 00:02:00  0.944810
2011-01-02 00:03:00  0.171831
2011-01-02 00:04:00  0.316064

This converts the index to a series using to_seriesso we can call diff and we can then compare this with a timedelta of 1 minute, we also handle the first row case where diff will return NaT

Upvotes: 2

Related Questions