Reputation: 133
I have a dataframe that has a time column where the values are strings. I would like to subset the dataframe so that only the values that sit within a window are in the subset. Currently I am using
date_format = '%Y-%m-%d'
window_start = datetime.strptime('2000-01-01', date_format)
window_end = datetime.strptime('2010-12-31', date_format)
subs_df = pandas.DataFrame(index=np.arange(0, 0),
columns = list(orig_df.columns))
for i, row in orig_df.iterrows():
date = datetime.strptime(row.time, date_format)
f date >= window_start and date <= window_end:
subs_df = subs_df.append(row, ignore_index=True)
This is tremendously slow. I have a feeling that I am doing something fundamentally wrong. What would be a better way of doing this?
Thank you for your time.
Upvotes: 2
Views: 1920
Reputation: 164623
Yes, this will be slow. Here are some pointers:
for
loops. You can use Boolean indexing and, since Pandas datetime
series are stored internally as integers, this takes advantage of vectorisation.datetime
objects with Pandas. Use pd.Timestamp
objects instead. Often, strings are sufficient as this conversion occurs internally.pd.DataFrame.append
in a loop. This is an expensive operation as it involves unnecessary copying of data.Here's an example of what you can do:
# convert series to Pandas datetime
orig_df['time'] = pd.to_datetime(orig_df['time'])
# construct Boolean mask
mask = orig_df['time'].between('2000-01-01', '2010-12-31')
# apply Boolean mask
new_df = orig_df[mask]
Upvotes: 6