Reputation: 103
The full dataframe is 2.4GB. Currently I am creating a boolean mask based on index and selecting a subset of dataframe. The index is a DateTimeIndex. It takes approx 10 min to get the subset dataframe . Is there a faster way to do this? Please see the code below for your reference.
train_data = pd.read_pickle(self.pkl_path + '/' + "train_data.pkl")
new_train_data = train_data[(train_data.index.date <= self.end_train.date())
& (self.start_train.date() <= train_data.index.date)]
Upvotes: 2
Views: 1817
Reputation: 5247
Let's create some data first:
import random, string
import pandas as pd
from datetime import datetime, timedelta
today=datetime.now()
idx=pd.date_range(today, today + timedelta(4999), freq='D')
df = pd.DataFrame([{
'name1': ''.join([random.choice(string.ascii_letters) for i in range(10)]),
'name2': ''.join([random.choice(string.ascii_letters) for i in range(10)]),
'val1': random.randint(0, 2**16),
'val2': random.randint(0, 2**16),
'val3': random.randint(0, 2**16),
} for j in range(5000)],
index=idx
)
and now check for execution times using df.index
and a comparison to a date string:
%timeit df[(df.index>='2019-06-01')&('2019-08-01'>=df.index)]
742 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
versus a comparison between explicit datetime.date
objects (i.e. using df.index.date
, the compared dates are the same as above):
%timeit df[(df.index.date>=(today + timedelta(days=25)).date())&((today + timedelta(days=25)).date()>=df.index.date)]
4.2 ms ± 82.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It appears you are using the latter, while the former is more performant.
Update:
If you want to be a bit faster, you could be more explicit, i.e. relay to a comparison of numpy.datetime64
arrays by using:
start_date=pd.to_datetime(['2019-06-01 00:00:00']).values[0]
end_date=pd.to_datetime(['2019-08-01 00:00:00']).values[0]
%timeit df[(df.index.values>=start_date)&(df.index.values<=end_date)]
594 µs ± 13.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 1