subsetting a pandas DataFrame based on time

Question

I have a dataframe that has a time column where the values are strings. I would like to subset the dataframe so that only the values that sit within a window are in the subset. Currently I am using

date_format = '%Y-%m-%d'  
window_start = datetime.strptime('2000-01-01', date_format)  
window_end = datetime.strptime('2010-12-31', date_format)  
subs_df = pandas.DataFrame(index=np.arange(0, 0),   
                      columns = list(orig_df.columns))  

for i, row in orig_df.iterrows():  
    date = datetime.strptime(row.time, date_format)  
    f date >= window_start and date <= window_end:  
        subs_df = subs_df.append(row, ignore_index=True)

This is tremendously slow. I have a feeling that I am doing something fundamentally wrong. What would be a better way of doing this?

Thank you for your time.

jpp · Accepted Answer

Yes, this will be slow. Here are some pointers:

With Pandas, avoid Python-level for loops. You can use Boolean indexing and, since Pandas datetime series are stored internally as integers, this takes advantage of vectorisation.
Avoid Python built-in datetime objects with Pandas. Use pd.Timestamp objects instead. Often, strings are sufficient as this conversion occurs internally.
Avoid, at all costs, using pd.DataFrame.append in a loop. This is an expensive operation as it involves unnecessary copying of data.

Here's an example of what you can do:

# convert series to Pandas datetime
orig_df['time'] = pd.to_datetime(orig_df['time'])

# construct Boolean mask
mask = orig_df['time'].between('2000-01-01', '2010-12-31')

# apply Boolean mask
new_df = orig_df[mask]

subsetting a pandas DataFrame based on time

Answers (1)

Related Questions