user3605780
user3605780

Reputation: 7072

duplicates by timestamp difference pandas not working

I found this example . I need to get all rows where within one second the next row has the same price. So it's same problem as the other question but it isn't working for me.

I'm using python 3.6.3 and pandas version 0.22.0

my df:

                    timestamp   price
    0  2018-04-29 13:14:43.026  6394.0
    1  2018-04-29 13:16:53.714  6378.0
    2  2018-04-29 13:17:01.324  6378.0
    3  2018-04-29 13:17:02.246  6384.0
    4  2018-04-29 13:17:07.413  6384.0
    5  2018-04-29 13:17:08.331  6380.0
    6  2018-04-29 13:17:11.437  6380.0
    7  2018-04-29 13:17:11.895  6377.0
    8  2018-04-29 13:17:13.449  6377.0
    9  2018-04-29 13:17:13.452  6379.0

    timestamp    datetime64[ns]
    price               float64
    dtype: object

I try the following code:

 data = df[(df.groupby(["price"], as_index=False)["timestamp"].diff().fillna(0).dt.seconds <= 1).reset_index(drop=True)]

and this is the result:

                    timestamp   price
    0 2018-04-29 13:14:43.026  6394.0
    1 2018-04-29 13:16:53.714  6378.0
    2 2018-04-29 13:17:01.324  6378.0
    4 2018-04-29 13:17:07.413  6384.0
    5 2018-04-29 13:17:08.331  6380.0
    7 2018-04-29 13:17:11.895  6377.0
    9 2018-04-29 13:17:13.452  6379.0

EDIT:

The output should be empty but if we make this dataframe i.e.

                    timestamp   price
    0  2018-04-29 13:14:43.026  6394.0
    1  2018-04-29 13:16:53.714  6378.0
    2  2018-04-29 13:17:01.324  6378.0
    3  2018-04-29 13:17:02.246  6378.0
    4  2018-04-29 13:17:07.413  6384.0
    5  2018-04-29 13:17:08.331  6380.0
    6  2018-04-29 13:17:11.437  6380.0
    7  2018-04-29 13:17:11.895  6377.0
    8  2018-04-29 13:17:13.449  6377.0
    9  2018-04-29 13:17:13.452  6377.0

it should output:

                    timestamp   price

    2  2018-04-29 13:17:01.324  6378.0
    3  2018-04-29 13:17:02.246  6378.0
    8  2018-04-29 13:17:13.449  6377.0
    9  2018-04-29 13:17:13.452  6377.0

Upvotes: 0

Views: 132

Answers (1)

Aaron Brock
Aaron Brock

Reputation: 4536

Unless I'm mistaken, I think you're over complicating this a bit, you should just need to take the diff:

df = pd.read_fwf(StringIO(
'''timestamp                price
2018-04-29 13:14:43.026  6394.0
2018-04-29 13:16:53.714  6378.0
2018-04-29 13:17:01.324  6378.0
2018-04-29 13:17:02.246  6378.0
2018-04-29 13:17:07.413  6384.0
2018-04-29 13:17:08.331  6380.0
2018-04-29 13:17:11.437  6380.0
2018-04-29 13:17:11.895  6377.0
2018-04-29 13:17:13.449  6377.0
2018-04-29 13:17:13.452  6379.0'''
), colspecs=[(0,23), (25, 31)], dtype={
    'timestamp': 'datetime64[ns]',
    'price': 'float'
})


diff = df.diff()
selection = (diff['timestamp'].dt.seconds <=1) & (diff['price'] == 0)
selection = selection | selection.shift(periods=-1)
print(df[selection])

Output

                timestamp   price
2 2018-04-29 13:17:01.324  6378.0
3 2018-04-29 13:17:02.246  6378.0
7 2018-04-29 13:17:11.895  6377.0
8 2018-04-29 13:17:13.449  6377.0

Upvotes: 1

Related Questions