Esa J. Obande
Esa J. Obande

Reputation: 25

Filter out unique values of a column within a set time interval

I have a dataframe with dates and tick-data like below

    Date                    Bid
0   20160601 00:00:00.020   160.225
1   20160601 00:00:00.136   160.226
2   20160601 00:00:00.192   160.225
3   20160601 00:00:00.327   160.230
4   20160601 00:00:01.606   160.231
5   20160601 00:00:01.613   160.230

I want to filter out unique values in the 'Bid' column at set intervals

E.g: 2016-06-01 00:00:00 - 00:15:00, 2016-06-01 00:15:00 - 00:30:00...

The result will be a new dataframe (keeping the filtered values with its datetime). Here's the code I have so far:

#Convert Date column to index with seconds as base
df['Date'] = pd.DatetimeIndex(df['Date'])
df['Date'] = df['Date'].astype('datetime64[s]')
df.set_index('Date', inplace=True)

#Create new DataFrame with filtered values
ts = pd.DataFrame(df.loc['2016-06-01'].between_time('00:00', '00:30')['Bid'].unique())

With the method above I loose the [Dates] (datetime) of the filtered values in the process of creating a new DataFrame plus I have to manually input each date and time interval which is unrealistic.

Output:
    0
0   160.225
1   160.226
2   160.230
3   160.231
4   160.232
5   160.228
6   160.227

Ideally I'm looking for an operation where I can set the time interval as a timedelta and have an operation done on the whole file (about 8Gb) at once, creating a new DataFrame with Date and Bid columns of the unique values within the set interval. Like this

    Date                    Bid
0   20160601 00:00:00.020   160.225
1   20160601 00:00:00.136   160.226
2   20160601 00:00:00.327   160.230
3   20160601 00:00:01.606   160.231
...
805 20160601 00:15:00.606   159.127

PS. I also tried using pd.rolling() & pd.resample() methods with apply(lambda x: function (eg. pd['Bid'].unique()) but it never was able to cut it, maybe someone better at it could attempt.

Upvotes: 1

Views: 589

Answers (1)

user3483203
user3483203

Reputation: 51155

Just to clarify: This is not a rolling calculation. You mentioned attempting to solve this using rolling, but from your clarification it seems you want to split the time series into discrete, non-overlapping 15 minutes sequences.

Setup

df = pd.DataFrame({
    'Date': [
        '2016-06-01 00:00:00.020', '2016-06-01 00:00:00.136',
        '2016-06-01 00:15:00.636', '2016-06-01 00:15:02.836',
    ],
    'Bid': [150, 150, 200, 200]
})

print(df)

                      Date  Bid
0  2016-06-01 00:00:00.020  150
1  2016-06-01 00:00:00.136  150         # Should be dropped
2  2016-06-01 00:15:00.636  200
3  2016-06-01 00:15:02.836  200         # Should be dropped

First, verify that your Date column is datetime:

df.Date = pd.to_datetime(df.Date)

Now use dt.floor to round each value down to the nearest 15 minutes, and use this new column to drop_duplicates per 15 minute window, but still keep the precision of your dates.

df.assign(flag=df.Date.dt.floor('15T')).drop_duplicates(['flag', 'Bid']).drop('flag', 1)

                     Date  Bid
0 2016-06-01 00:00:00.020  150
2 2016-06-01 00:15:00.636  200

From my original answer, but I still believe it holds value. If you'd like to access the unique values per group, you can make use of pd.Grouper and unique, and I believe learning to leverage pd.Grouper is a powerful tool to have with pandas:

df.groupby(pd.Grouper(key='Date', freq='15T')).Bid.unique()

Date
2016-06-01 00:00:00    [150]
2016-06-01 00:15:00    [200]
Freq: 15T, Name: Bid, dtype: object

Upvotes: 1

Related Questions