Reputation: 1102
I am preparing some data for cohort analysis. The information I have is similar to the fake data set that can be generated with the code below:
import random
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
# prepare some fake data to build frames
subscription_prices = [x - 0.05 for x in range(100, 500, 25)]
companies = ['initech','ingen','weyland','tyrell']
starting_periods = ['2014-12-10','2015-1-15','2014-11-20','2015-2-9']
# use the lists and dict from above to create a fake dataset
pieces = []
for company, period in zip(companies,starting_periods):
data = {
'company': company,
'revenue': random.choice(subscription_prices),
'invoice_date': pd.date_range(period,periods=12,freq='31D')
}
frame = DataFrame(data)
pieces.append(frame)
df = pd.concat(pieces, ignore_index=True)
I need to normalize the invoice date to a monthly period. For a number of reasons, it's best to shift all invoice_date
values to the end of the month. I used this method:
from pandas.tseries.offsets import *
df['rev_period'] = df['invoice_date'].apply(lambda x: MonthEnd(normalize=True).rollforward(x))
However, even at only a million rows (which is the size of my actual data set) this becomes painfully slow:
In [11]: %time df['invoice_date'].apply(lambda x: MonthEnd(normalize=True).rollforward(x))
CPU times: user 3min 11s, sys: 1.44 s, total: 3min 12s
Wall time: 3min 17s
The great part about this method of date offsetting w/ Pandas is that if the invoice_date
happens to fall on the last day of the month, that date will stay as the last day of the month. The other nice thing is that this keeps the dtype
as datetime
, whereas df['invoice_date'].apply(lambda x: x.strftime('%Y-%m'))
is faster but converts the values to str
.
Is there a vectorized way of doing this? I tried MonthEnd(normalize=True).rollforward(df['invoice_date'])
but got the error TypeError: Cannot convert input to Timestamp
.
Upvotes: 4
Views: 1031
Reputation: 2238
Yes, there is:
df['rev_period'] = df['invoice_date'] + pd.offsets.MonthEnd(0)
Should be at least an order of magnitude faster.
Upvotes: 4