BearsBeetBattlestar
BearsBeetBattlestar

Reputation: 318

Improve performance filtering on object column with dates

Working on the NYC car crash dataset. The current csv has data from 2012 and I wanted to extract out the 2017 values. Thing is DATE is an object dtype and trying something like this ended up taking too long

data = data[(pd.to_datetime((data['DATE'])).dt.year == 2017)]

Here is how a sample of the data looks

Head of the dataset

Does anyone possibly know of a faster or more efficient way? Thank you.

Upvotes: 1

Views: 59

Answers (1)

jpp
jpp

Reputation: 164773

The bottleneck is almost certainly in datetime conversion.

To improve performance, you can utilise the infer_datetime_format argument of pd.to_datetime. As per the docs:

infer_datetime_format : boolean, default False

If True and no format is given, attempt to infer the format of the datetime strings, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by ~5-10x.

For example:

data['DATE'] = pd.to_datetime(data['DATE'], infer_datetime_format=True)
data = data[data['DATE'].dt.year == 2017)]

Alternatively, you can specify the precise format:

data['DATE'] = pd.to_datetime(data['DATE'], format='%m/%d/%Y')
data = data[data['DATE'].dt.year == 2017)]

Upvotes: 2

Related Questions