Faster way to subset a dataframe using pandas

Question

The full dataframe is 2.4GB. Currently I am creating a boolean mask based on index and selecting a subset of dataframe. The index is a DateTimeIndex. It takes approx 10 min to get the subset dataframe . Is there a faster way to do this? Please see the code below for your reference.

        train_data = pd.read_pickle(self.pkl_path + '/' + "train_data.pkl")
        new_train_data = train_data[(train_data.index.date <= self.end_train.date())
                                    & (self.start_train.date() <= train_data.index.date)]

Asmus · Accepted Answer

Let's create some data first:

import random, string
import pandas as pd
from datetime import datetime, timedelta

today=datetime.now()
idx=pd.date_range(today, today + timedelta(4999), freq='D')
df = pd.DataFrame([{
    'name1': ''.join([random.choice(string.ascii_letters) for i in range(10)]),
    'name2': ''.join([random.choice(string.ascii_letters) for i in range(10)]),
    'val1': random.randint(0, 2**16),
    'val2': random.randint(0, 2**16),
    'val3': random.randint(0, 2**16),
    } for j in range(5000)],
    index=idx
)

and now check for execution times using df.index and a comparison to a date string:

%timeit df[(df.index>='2019-06-01')&('2019-08-01'>=df.index)]

742 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

versus a comparison between explicit datetime.date objects (i.e. using df.index.date, the compared dates are the same as above):

%timeit df[(df.index.date>=(today + timedelta(days=25)).date())&((today + timedelta(days=25)).date()>=df.index.date)]

4.2 ms ± 82.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It appears you are using the latter, while the former is more performant.

Update:

If you want to be a bit faster, you could be more explicit, i.e. relay to a comparison of numpy.datetime64 arrays by using:

start_date=pd.to_datetime(['2019-06-01 00:00:00']).values[0]
end_date=pd.to_datetime(['2019-08-01 00:00:00']).values[0]

%timeit df[(df.index.values>=start_date)&(df.index.values<=end_date)]

594 µs ± 13.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Faster way to subset a dataframe using pandas

Answers (1)

Related Questions