Reputation: 318
Working on the NYC car crash dataset. The current csv has data from 2012 and I wanted to extract out the 2017 values. Thing is DATE is an object dtype and trying something like this ended up taking too long
data = data[(pd.to_datetime((data['DATE'])).dt.year == 2017)]
Here is how a sample of the data looks
Does anyone possibly know of a faster or more efficient way? Thank you.
Upvotes: 1
Views: 59
Reputation: 164773
The bottleneck is almost certainly in datetime
conversion.
To improve performance, you can utilise the infer_datetime_format
argument of pd.to_datetime
. As per the docs:
infer_datetime_format : boolean, default False
If True and no format is given, attempt to infer the format of the datetime strings, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by ~5-10x.
For example:
data['DATE'] = pd.to_datetime(data['DATE'], infer_datetime_format=True)
data = data[data['DATE'].dt.year == 2017)]
Alternatively, you can specify the precise format:
data['DATE'] = pd.to_datetime(data['DATE'], format='%m/%d/%Y')
data = data[data['DATE'].dt.year == 2017)]
Upvotes: 2