Reputation: 8917

Drop Duplicates in a DataFrame if Timestamps are Close, but not Identical

Imagine that I've got the following DataFrame

            A        | B | C | D
 -------------------------------
 2000-01-01 00:00:00 | 1 | 1 | 1
 2000-01-01 00:04:30 | 1 | 2 | 2
 2000-01-01 00:04:30 | 2 | 3 | 3
 2000-01-02 00:00:00 | 1 | 4 | 4

And I want to drop rows where B are equal, and the values in A are "close". Say, withing five minutes of each other. So in this case the first two rows, but keep the last two.

So, instead of doing df.dropna(subset=['A', 'B'], inplace=True, keep=False), I'd like something that's more like df.dropna(subset=['A', 'B'], inplace=True, keep=False, func={'A': some_func}). With

def some_func(ts1, ts2):
    delta = ts1 - ts2
    return abs(delta.total_seconds()) >= 5 * 60

Is there a way to do this in Pandas?

Upvotes: 1

Answers (3)

Haleemur Ali

Reputation: 28243

write a function that accepts a data frame, calculates the delta between two successive timestamps, and return the filtered dataframe. Then groupby & apply.

import pandas as pd
import datetime

# this one preserves 1 row from two or more closeby rows.
def filter_window(df):
    df['filt'] = (df.A - df.A.shift(1)) / datetime.timedelta(minutes=1)
    df['filt'] = df.filt.fillna(10.0)
    df = df[(df.filt > 5.0) | pd.isnull(df.filt)]
    return df[['A', 'C', 'D']]

df2 = df.groupby('B').apply(filter_window).reset_index()

# With your sample dataset, this is the output of df2

                      A     B   C   D
0   2000-01-01 00:00:00     1   1   1
1   2000-01-02 00:00:00     1   4   4
2   2000-01-01 00:04:30     2   3   3

# this one drops all closeby rows.
def filter_window2(df):
    df['filt'] = (df.A - df.A.shift(1)) / datetime.timedelta(minutes=1)
    df['filt2'] = (df.A.shift(-1) - df.A) / datetime.timedelta(minutes=1)
    df['filt'] = df.filt.fillna(df.filt2)
    df = df[(df.filt > 5.0) | pd.isnull(df.filt)]
    return df[['A', 'C', 'D']]

df3 = df.groupby('B').apply(filter_window2).reset_index()

# With your sample dataset, this is the output of df3

                      A     B   C   D
0   2000-01-02 00:00:00     1   4   4
1   2000-01-01 00:04:30     2   3   3

Upvotes: 0

cs95

Reputation: 402363

m = df.groupby('B').A.apply(lambda x: x.diff().dt.seconds < 300)
m2 = df.B.duplicated(keep=False) & (m | m.shift(-1))
df[~m2]
                    A  B  C  D
2 2000-01-01 00:04:30  2  3  3
3 2000-01-02 00:00:00  1  4  4

Details

m gets a mask of all rows within 5 minutes of each other.

m

0    False
1     True
2    False
3    False
Name: A, dtype: bool

m2 is the final mask of all items that must be dropped.

m2

0     True
1     True
2    False
3    False
dtype: bool

Upvotes: 4

BENY

Reputation: 323226

I break down the steps ...And you can test with your real data to see whether it works or not ..

df['dropme']=df.A.diff().shift(-1).dt.seconds/60
df['dropme2']=df.A
df.loc[df.dropme<=5,'dropme2']=1
df.drop_duplicates(['dropme2'],keep=False).drop(['dropme','dropme2'],axis=1)
Out[553]: 
                    A  B  C  D
2 2000-01-01 00:04:30  2  3  3
3 2000-01-02 00:00:00  1  4  4

Upvotes: 1

Drop Duplicates in a DataFrame if Timestamps are Close, but not Identical

Answers (3)

Related Questions