Reputation: 6397

Performance issues with pandas and filtering on datetime column

I've a pandas dataframe with a datetime64 object on one of the columns.

    time    volume  complete    closeBid    closeAsk    openBid openAsk highBid highAsk lowBid  lowAsk  closeMid
0   2016-08-07 21:00:00+00:00   9   True    0.84734 0.84842 0.84706 0.84814 0.84734 0.84842 0.84706 0.84814 0.84788
1   2016-08-07 21:05:00+00:00   10  True    0.84735 0.84841 0.84752 0.84832 0.84752 0.84846 0.84712 0.8482  0.84788
2   2016-08-07 21:10:00+00:00   10  True    0.84742 0.84817 0.84739 0.84828 0.84757 0.84831 0.84735 0.84817 0.847795
3   2016-08-07 21:15:00+00:00   18  True    0.84732 0.84811 0.84737 0.84813 0.84737 0.84813 0.84721 0.8479  0.847715
4   2016-08-07 21:20:00+00:00   4   True    0.84755 0.84822 0.84739 0.84812 0.84755 0.84822 0.84739 0.84812 0.847885
5   2016-08-07 21:25:00+00:00   4   True    0.84769 0.84843 0.84758 0.84827 0.84769 0.84843 0.84758 0.84827 0.84806
6   2016-08-07 21:30:00+00:00   5   True    0.84764 0.84851 0.84768 0.84852 0.8478  0.84857 0.84764 0.84851 0.848075
7   2016-08-07 21:35:00+00:00   4   True    0.84755 0.84825 0.84762 0.84844 0.84765 0.84844 0.84755 0.84824 0.8479
8   2016-08-07 21:40:00+00:00   1   True    0.84759 0.84812 0.84759 0.84812 0.84759 0.84812 0.84759 0.84812 0.847855
9   2016-08-07 21:45:00+00:00   3   True    0.84727 0.84817 0.84743 0.8482  0.84743 0.84822 0.84727 0.84817 0.84772

My application follows the (simplified) structure below:

class Runner():
    def execute_tick(self, clock_tick, previous_tick):
        candles = self.broker.get_new_candles(clock_tick, previous_tick)
        if candles:
            run_calculations(candles)

class Broker():
    def get_new_candles(clock_tick, previous_tick)
        start = previous_tick - timedelta(minutes=1)
        end = clock_tick - timedelta(minutes=3)
        return df[(df.time > start) & (df.time <= end)]

I noticed when profiling the app, that calling the df[(df.time > start) & (df.time <= end)] causes the highest performance issues and I was wondering if there is a way to speed up these calls?

EDIT: I'm adding some more info about the use-case here (also, source is available at: https://github.com/jmelett/pyFxTrader)

The application will accept a list of instruments (e.g. EUR_USD, USD_JPY, GBP_CHF) and then pre-fetch ticks/candles for each one of them and their timeframes (e.g. 5 minutes, 30 minutes, 1 hour etc.). The initialised data is basically a dict of Instruments, each containing another dict with candle data for M5, M30, H1 timeframes.
Each "timeframe" is a pandas dataframe like shown at the top
A clock simulator is then used to query the individual candles for the specific time (e.g. at 15:30:00, give me the last x "5-minute-candles") for EUR_USD
This piece of data is then used to "simulate" specific market conditions (e.g. average price over last 1 hour increased by 10%, buy market position)

Upvotes: 14

Answers (6)

piRSquared

Reputation: 294488

If efficiency is your goal, I'd use numpy for just about everything

I rewrote get_new_candles as get_new_candles2

def get_new_candles2(clock_tick, previous_tick):
    start = previous_tick - timedelta(minutes=1)
    end = clock_tick - timedelta(minutes=3)
    ge_start = df.time.values >= start.to_datetime64()
    le_end = df.time.values <= end.to_datetime64()
    return pd.DataFrame(df.values[ge_start & le_end], df.index[mask], df.columns)

Setup of data

from StringIO import StringIO
import pandas as pd

text = """time,volume,complete,closeBid,closeAsk,openBid,openAsk,highBid,highAsk,lowBid,lowAsk,closeMid
2016-08-07 21:00:00+00:00,9,True,0.84734,0.84842,0.84706,0.84814,0.84734,0.84842,0.84706,0.84814,0.84788
2016-08-07 21:05:00+00:00,10,True,0.84735,0.84841,0.84752,0.84832,0.84752,0.84846,0.84712,0.8482,0.84788
2016-08-07 21:10:00+00:00,10,True,0.84742,0.84817,0.84739,0.84828,0.84757,0.84831,0.84735,0.84817,0.847795
2016-08-07 21:15:00+00:00,18,True,0.84732,0.84811,0.84737,0.84813,0.84737,0.84813,0.84721,0.8479,0.847715
2016-08-07 21:20:00+00:00,4,True,0.84755,0.84822,0.84739,0.84812,0.84755,0.84822,0.84739,0.84812,0.847885
2016-08-07 21:25:00+00:00,4,True,0.84769,0.84843,0.84758,0.84827,0.84769,0.84843,0.84758,0.84827,0.84806
2016-08-07 21:30:00+00:00,5,True,0.84764,0.84851,0.84768,0.84852,0.8478,0.84857,0.84764,0.84851,0.848075
2016-08-07 21:35:00+00:00,4,True,0.84755,0.84825,0.84762,0.84844,0.84765,0.84844,0.84755,0.84824,0.8479
2016-08-07 21:40:00+00:00,1,True,0.84759,0.84812,0.84759,0.84812,0.84759,0.84812,0.84759,0.84812,0.847855
2016-08-07 21:45:00+00:00,3,True,0.84727,0.84817,0.84743,0.8482,0.84743,0.84822,0.84727,0.84817,0.84772
"""

df = pd.read_csv(StringIO(text), parse_dates=[0])

Test input variables

previous_tick = pd.to_datetime('2016-08-07 21:10:00')
clock_tick = pd.to_datetime('2016-08-07 21:45:00')

get_new_candles2(clock_tick, previous_tick)

Timing

Upvotes: 3

Gustavo Bezerra

Reputation: 11044

I guess you are already running things in a relatively efficient way.

When working with time series, it's usually best practice use the column using your timestamps as the DataFrame index. Using a RangeIndex as your index isn't of much use. However, I ran a couple of tests on a (2650069, 2) DataFrame containing 6 months of trade tick data from a given stock at a given exchange and it turns out your approach (creating a boolean array and using it to slice the DataFrame) seems to be 10x faster than regular DatetimeIndex slicing (which I thought was faster).

The data I tested looks like:

                                Price  Volume
time                                         
2016-02-10 11:16:15.951403000  6197.0   200.0
2016-02-10 11:16:16.241380000  6197.0   100.0
2016-02-10 11:16:16.521871000  6197.0   900.0
2016-02-10 11:16:16.541253000  6197.0   100.0
2016-02-10 11:16:16.592049000  6196.0   200.0

Setting start/end:

start = df.index[len(df)/4]
end = df.index[len(df)/4*3]

Test 1:

%%time
_ = df[start:end]  # Same for df.ix[start:end]

CPU times: user 413 ms, sys: 20 ms, total: 433 ms
Wall time: 430 ms

On the other hand, using your approach:

df = df.reset_index()
df.columns = ['time', 'Price', 'Volume']

Test 2:

%%time
u = (df['time'] > start) & (df['time'] <= end)

CPU times: user 21.2 ms, sys: 368 µs, total: 21.6 ms
Wall time: 20.4 ms

Test 3:

%%time
_ = df[u]

CPU times: user 10.4 ms, sys: 27.6 ms, total: 38.1 ms
Wall time: 36.8 ms

Test 4:

%%time
_ = df[(df['time'] > start) & (df['time'] <= end)]

CPU times: user 21.6 ms, sys: 24.3 ms, total: 45.9 ms
Wall time: 44.5 ms

Note: Each code block corresponds to an Jupyter notebook cell and its output. I am using the %%time magic because %%timeit usually yields some caching problems which make the code seems faster than it actually is. Also, the kernel has been restarted after each run.

I am not totally sure why that is the case (I thought slicing using a DatetimeIndex would make things faster), but I guess it probably has something to do with how things work under the hood with numpy (most likely the datetime slicing operation generates a boolean array which is then used internally by numpy to actually do the slicing - but don't quote me on that).

Upvotes: 0

Mansweet

Reputation: 151

I've learned that those datetime objects can become very memory hungry and require more computational effort, Especially if they are set to be the index (DatetimeIndex objects?)

I think your best bet is to just cast the df.time, start and end into UNIX timestamps (as ints, no longer the datetime dtypes), and do a simple integer comparison.

The UNIX timestamp will look like: 1471554233 (the time of this posting). More on that here: https://en.wikipedia.org/wiki/Unix_time

Some considerations when doing this (eg keep in mind timezones): Convert datetime to Unix timestamp and convert it back in python

Upvotes: 0

ℕʘʘḆḽḘ

Reputation: 19395

it seems to me that using the ix locator is faster

df.sort_values(by='time',inplace=True)
df.ix[(df.time > start) & (df.time <= end),:]

Upvotes: 0

den.run.ai

Reputation: 5943

Pandas query can use numexpr as engine to speed up evaluation:

df.query('time > @start & time <= @end')

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html

Upvotes: 0

ptrj

Reputation: 5222

It's a bit of speculation because I can't test it but two ideas come to my mind.

Using a lookup series to determine the start and end indices of a data frame to be returned:

s = pd.Series(np.arange(len(df)), index=df.time)
start = s.asof(start)
end = s.asof(end)
ret = df.iloc[start + 1 : end]

Setting df.time column as an index and taking a slice. (It may or may not be a good solution for other reasons since this column contains duplicates.)
```
df = df.set_index('time')
ret = df.loc[start : end]
```
You may need to add some small Timedelta to start.

In both cases you can do the main step (construct the series or set the index) only once per each data frame.

Upvotes: 0

Performance issues with pandas and filtering on datetime column

Answers (6)

Setup of data

Test input variables

Timing

Related Questions