Reputation: 6397
I've a pandas dataframe with a datetime64 object on one of the columns.
time volume complete closeBid closeAsk openBid openAsk highBid highAsk lowBid lowAsk closeMid
0 2016-08-07 21:00:00+00:00 9 True 0.84734 0.84842 0.84706 0.84814 0.84734 0.84842 0.84706 0.84814 0.84788
1 2016-08-07 21:05:00+00:00 10 True 0.84735 0.84841 0.84752 0.84832 0.84752 0.84846 0.84712 0.8482 0.84788
2 2016-08-07 21:10:00+00:00 10 True 0.84742 0.84817 0.84739 0.84828 0.84757 0.84831 0.84735 0.84817 0.847795
3 2016-08-07 21:15:00+00:00 18 True 0.84732 0.84811 0.84737 0.84813 0.84737 0.84813 0.84721 0.8479 0.847715
4 2016-08-07 21:20:00+00:00 4 True 0.84755 0.84822 0.84739 0.84812 0.84755 0.84822 0.84739 0.84812 0.847885
5 2016-08-07 21:25:00+00:00 4 True 0.84769 0.84843 0.84758 0.84827 0.84769 0.84843 0.84758 0.84827 0.84806
6 2016-08-07 21:30:00+00:00 5 True 0.84764 0.84851 0.84768 0.84852 0.8478 0.84857 0.84764 0.84851 0.848075
7 2016-08-07 21:35:00+00:00 4 True 0.84755 0.84825 0.84762 0.84844 0.84765 0.84844 0.84755 0.84824 0.8479
8 2016-08-07 21:40:00+00:00 1 True 0.84759 0.84812 0.84759 0.84812 0.84759 0.84812 0.84759 0.84812 0.847855
9 2016-08-07 21:45:00+00:00 3 True 0.84727 0.84817 0.84743 0.8482 0.84743 0.84822 0.84727 0.84817 0.84772
My application follows the (simplified) structure below:
class Runner():
def execute_tick(self, clock_tick, previous_tick):
candles = self.broker.get_new_candles(clock_tick, previous_tick)
if candles:
run_calculations(candles)
class Broker():
def get_new_candles(clock_tick, previous_tick)
start = previous_tick - timedelta(minutes=1)
end = clock_tick - timedelta(minutes=3)
return df[(df.time > start) & (df.time <= end)]
I noticed when profiling the app, that calling the df[(df.time > start) & (df.time <= end)]
causes the highest performance issues and I was wondering if there is a way to speed up these calls?
EDIT: I'm adding some more info about the use-case here (also, source is available at: https://github.com/jmelett/pyFxTrader)
dict
of Instruments, each containing another dict
with candle data for M5, M30, H1 timeframes. Upvotes: 14
Views: 3373
Reputation: 294488
If efficiency is your goal, I'd use numpy for just about everything
I rewrote get_new_candles
as get_new_candles2
def get_new_candles2(clock_tick, previous_tick):
start = previous_tick - timedelta(minutes=1)
end = clock_tick - timedelta(minutes=3)
ge_start = df.time.values >= start.to_datetime64()
le_end = df.time.values <= end.to_datetime64()
return pd.DataFrame(df.values[ge_start & le_end], df.index[mask], df.columns)
from StringIO import StringIO
import pandas as pd
text = """time,volume,complete,closeBid,closeAsk,openBid,openAsk,highBid,highAsk,lowBid,lowAsk,closeMid
2016-08-07 21:00:00+00:00,9,True,0.84734,0.84842,0.84706,0.84814,0.84734,0.84842,0.84706,0.84814,0.84788
2016-08-07 21:05:00+00:00,10,True,0.84735,0.84841,0.84752,0.84832,0.84752,0.84846,0.84712,0.8482,0.84788
2016-08-07 21:10:00+00:00,10,True,0.84742,0.84817,0.84739,0.84828,0.84757,0.84831,0.84735,0.84817,0.847795
2016-08-07 21:15:00+00:00,18,True,0.84732,0.84811,0.84737,0.84813,0.84737,0.84813,0.84721,0.8479,0.847715
2016-08-07 21:20:00+00:00,4,True,0.84755,0.84822,0.84739,0.84812,0.84755,0.84822,0.84739,0.84812,0.847885
2016-08-07 21:25:00+00:00,4,True,0.84769,0.84843,0.84758,0.84827,0.84769,0.84843,0.84758,0.84827,0.84806
2016-08-07 21:30:00+00:00,5,True,0.84764,0.84851,0.84768,0.84852,0.8478,0.84857,0.84764,0.84851,0.848075
2016-08-07 21:35:00+00:00,4,True,0.84755,0.84825,0.84762,0.84844,0.84765,0.84844,0.84755,0.84824,0.8479
2016-08-07 21:40:00+00:00,1,True,0.84759,0.84812,0.84759,0.84812,0.84759,0.84812,0.84759,0.84812,0.847855
2016-08-07 21:45:00+00:00,3,True,0.84727,0.84817,0.84743,0.8482,0.84743,0.84822,0.84727,0.84817,0.84772
"""
df = pd.read_csv(StringIO(text), parse_dates=[0])
previous_tick = pd.to_datetime('2016-08-07 21:10:00')
clock_tick = pd.to_datetime('2016-08-07 21:45:00')
get_new_candles2(clock_tick, previous_tick)
Upvotes: 3
Reputation: 11044
I guess you are already running things in a relatively efficient way.
When working with time series, it's usually best practice use the column using your timestamps as the DataFrame
index. Using a RangeIndex
as your index isn't of much use. However, I ran a couple of tests on a (2650069, 2) DataFrame containing 6 months of trade tick data from a given stock at a given exchange and it turns out your approach (creating a boolean array and using it to slice the DataFrame) seems to be 10x faster than regular DatetimeIndex
slicing (which I thought was faster).
The data I tested looks like:
Price Volume
time
2016-02-10 11:16:15.951403000 6197.0 200.0
2016-02-10 11:16:16.241380000 6197.0 100.0
2016-02-10 11:16:16.521871000 6197.0 900.0
2016-02-10 11:16:16.541253000 6197.0 100.0
2016-02-10 11:16:16.592049000 6196.0 200.0
Setting start
/end
:
start = df.index[len(df)/4]
end = df.index[len(df)/4*3]
Test 1:
%%time
_ = df[start:end] # Same for df.ix[start:end]
CPU times: user 413 ms, sys: 20 ms, total: 433 ms
Wall time: 430 ms
On the other hand, using your approach:
df = df.reset_index()
df.columns = ['time', 'Price', 'Volume']
Test 2:
%%time
u = (df['time'] > start) & (df['time'] <= end)
CPU times: user 21.2 ms, sys: 368 µs, total: 21.6 ms
Wall time: 20.4 ms
Test 3:
%%time
_ = df[u]
CPU times: user 10.4 ms, sys: 27.6 ms, total: 38.1 ms
Wall time: 36.8 ms
Test 4:
%%time
_ = df[(df['time'] > start) & (df['time'] <= end)]
CPU times: user 21.6 ms, sys: 24.3 ms, total: 45.9 ms
Wall time: 44.5 ms
Note: Each code block corresponds to an Jupyter notebook cell and its output. I am using the %%time
magic because %%timeit
usually yields some caching problems which make the code seems faster than it actually is. Also, the kernel has been restarted after each run.
I am not totally sure why that is the case (I thought slicing using a DatetimeIndex
would make things faster), but I guess it probably has something to do with how things work under the hood with numpy (most likely the datetime slicing operation generates a boolean array which is then used internally by numpy to actually do the slicing - but don't quote me on that).
Upvotes: 0
Reputation: 151
I've learned that those datetime objects can become very memory hungry and require more computational effort, Especially if they are set to be the index (DatetimeIndex objects?)
I think your best bet is to just cast the df.time, start and end into UNIX timestamps (as ints, no longer the datetime dtypes), and do a simple integer comparison.
The UNIX timestamp will look like: 1471554233 (the time of this posting). More on that here: https://en.wikipedia.org/wiki/Unix_time
Some considerations when doing this (eg keep in mind timezones): Convert datetime to Unix timestamp and convert it back in python
Upvotes: 0
Reputation: 19395
it seems to me that using the ix
locator is faster
df.sort_values(by='time',inplace=True)
df.ix[(df.time > start) & (df.time <= end),:]
Upvotes: 0
Reputation: 5943
Pandas query can use numexpr as engine to speed up evaluation:
df.query('time > @start & time <= @end')
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html
Upvotes: 0
Reputation: 5222
It's a bit of speculation because I can't test it but two ideas come to my mind.
Using a lookup series to determine the start and end indices of a data frame to be returned:
s = pd.Series(np.arange(len(df)), index=df.time)
start = s.asof(start)
end = s.asof(end)
ret = df.iloc[start + 1 : end]
Setting df.time
column as an index and taking a slice. (It may or may not be a good solution for other reasons since this column contains duplicates.)
df = df.set_index('time')
ret = df.loc[start : end]
You may need to add some small Timedelta
to start
.
In both cases you can do the main step (construct the series or set the index) only once per each data frame.
Upvotes: 0