IamTheWalrus
IamTheWalrus

Reputation: 604

Filtering pandas dataframe by difference of adjacent rows

I have a dataframe indexed by datetime. I want to filter out rows based on the difference between their index and the index of the previous row.

So, if my criteria is "remove all rows that are over one hour late than the previous row", the second row in the example below should be removed:

2005-07-15 17:00:00  
2005-07-17 18:00:00  

While in the following case, both rows stay:

2005-07-17 23:00:00  
2005-07-18 00:00:00 

Upvotes: 3

Views: 1674

Answers (1)

jezrael
jezrael

Reputation: 863611

It seems you need boolean indexing with diff for difference and compare with 1 hour Timedelta:

dates=['2005-07-15 17:00:00','2005-07-17 18:00:00', '2005-07-17 19:00:00',  
      '2005-07-17 23:00:00', '2005-07-18 00:00:00']
df = pd.DataFrame({'a':range(5)}, index=pd.to_datetime(dates))

print (df)
                     a
2005-07-15 17:00:00  0
2005-07-17 18:00:00  1
2005-07-17 19:00:00  2
2005-07-17 23:00:00  3
2005-07-18 00:00:00  4

diff = df.index.to_series().diff().fillna(0)
print (diff)
2005-07-15 17:00:00   0 days 00:00:00
2005-07-17 18:00:00   2 days 01:00:00
2005-07-17 19:00:00   0 days 01:00:00
2005-07-17 23:00:00   0 days 04:00:00
2005-07-18 00:00:00   0 days 01:00:00
dtype: timedelta64[ns]

mask = diff <= pd.Timedelta(1, unit='h')
print (mask)
2005-07-15 17:00:00     True
2005-07-17 18:00:00    False
2005-07-17 19:00:00     True
2005-07-17 23:00:00    False
2005-07-18 00:00:00     True
dtype: bool

df = df[mask]
print (df)
                     a
2005-07-15 17:00:00  0
2005-07-17 19:00:00  2
2005-07-18 00:00:00  4

Upvotes: 4

Related Questions