H T
H T

Reputation: 3

Find Gaps in a Pandas Dataframe

I have a Dataframe which has a column for Minutes and correlated value, the frequency is about 79 seconds but sometimes there is missing data for a period (no rows at all). I want to detect if there is a gap of 25 or more Minutes and delete the dataset if so. How do I test if there is a gap which is?

The dataframe looks like this:

INDEX    minutes     data                                              
0        23.000      1.456
1        24.185      1.223
2        27.250      0.931
3        55.700      2.513
4        56.790      1.446
...      ...         ...

So there is a irregular but short gap and one that exceeds 25 Minutes. In this case I want the dataset to be empty:

I am quite new to Python, especially to Pandas so an explanation would be helpful to learn.

Upvotes: 0

Views: 1819

Answers (1)

jfaccioni
jfaccioni

Reputation: 7509

You can use numpy.roll to create a column with shifted values (i.e. the first value from the original column becomes the second value, the second becomes the third, etc):

import pandas as pd
import numpy as np

df = pd.DataFrame({'minutes': [23.000, 24.185, 27.250, 55.700, 56.790]})

np.roll(df['minutes'], 1)                                                   
# output: array([56.79 , 23.   , 24.185, 27.25 , 55.7  ])

Add this as a new column to your dataframe and subtract the original column with the new column.

We also drop the first row beforehand, since we don't want to calculate the difference from your first timepoint in the original column and your last timepoint that got rolled to the start of the new column.

Then we just ask if any of the values resulting from the subtraction is above your threshold:

df['rolled_minutes'] = np.roll(df['minutes'], 1)
dropped_df = df.drop(index=0)
diff = dropped_df['minutes'] - dropped_df['rolled_minutes']
(diff > 25).any()
# output: True

Upvotes: 1

Related Questions