Reputation: 337
I have a dataframe with year-quarter (e.g. 2015-Q4), the customer_ID, and amount booked, and many other columns irrelevant for now. I want to create a column that has the first time each customer made a booking. I tried this:
alldata.sort_values(by=['Total_Apps_Reseller_Bookings_USD', 'Year_Quarter'],
ascending=[1, 1],
inplace=True)
first_q = alldata[['Customer_ID', 'Year_Quarter']].groupby(by='Customer_ID').first()
but I am not sure it worked.
Also, I then want to have another column that tells me how many quarters after the first booking that booking was made. I failed using replace and dictionary, so I used a merge. I create an numeric id for each quarter of booking, and first quarter from above, and then subtract the two:
q_booking_num = pd.DataFrame({'Year_Quarter': x, 'First_Quarter_id': np.arange(28)})
alldata = pd.merge(alldata, q_booking_num, on='Year_Quarter', how='outer')
q_first_num = pd.DataFrame({'First_Quarter': x, 'First_Quarter_id': np.arange(28)})
alldata = pd.merge(alldata, q_first_num, on='First_Quarter', how='outer')
this doesn't seem to have worked at all as I see 'first quarters' that are after some bookings that were already made.
Upvotes: 4
Views: 2413
Reputation: 56
For part 1:
I think you need to sort a little differently to get your desired outcome:
alldata.sort_values(by=['Customer_ID', 'Year_Quarter',
'Total_Apps_Reseller_Bookings_USD'],
ascending=[1, 1],inplace=True)
first_q = alldata[['Customer_ID','Year_Quarter']].groupby(by='Customer_ID').head(1)
For part 2:
Continuing off of part 1, you can merge the values back on to the original dataframe. At that point, you can write a custom function to subtract your date strings and then apply it to each row.
Something like:
def qt_sub(val, first):
year_dif = val[0:4] - first[0:4]
qt_dif = val[6] - first[6]
return 4 * int(year_dif) + int(qt_dif)
alldata['diff_from_first'] = alldata.apply(lambda x: qt_sub(x['Year_Quarter'],
x['First_Sale']),
axis=1)
Upvotes: 0
Reputation: 109526
You need to specify which column to use for taking the first value:
first_q = (alldata[['Customer_ID','Year_Quarter']]
.groupby(by='Customer_ID')
.Year_Quarter
.first()
)
Here is some sample data for three customers:
df = pd.DataFrame({'customer_ID': [1,
2, 2,
3, 3, 3],
'Year_Quarter': ['2010-Q1',
'2010-Q1', '2011-Q1',
'2010-Q1', '2011-Q1', '2012-Q1'],
'Total_Apps_Reseller_Bookings_USD': [1,
2, 3,
4, 5, 6]})
Below, I convert text quarters (e.g. '2010-Q1') to a numeric equivalent by taking the int value of the first for characters (df.Year_Quarter.str[:4].astype(int)
). I then multiply it by four and add the value of the quarter. This value is only used for differencing to determine the total number of quarters since the first order.
Next, I use transform
on the groupby
to take the min value of these quarters we just calculated. Using transform
keeps this value in the same shape as the original dataframe.
I then calcualte the quarters_since_first_order
as the difference between the quarter and the first quarter.
df['quarters'] = df.Year_Quarter.str[:4].astype(int) * 4 + df.Year_Quarter.str[-1].astype(int)
first_order_quarter_no = df.groupby('customer_ID').quarters.transform(min)
df['quarters_since_first_order'] = quarters - first_order_quarter_no
del df['quarters'] # Clean-up.
>>> df
Total_Apps_Reseller_Bookings_USD Year_Quarter customer_ID quarters_since_first_order
0 1 2010-Q1 1 0
1 2 2010-Q1 2 0
2 3 2011-Q1 2 4
3 4 2010-Q1 3 0
4 5 2011-Q1 3 4
5 6 2012-Q1 3 8
Upvotes: 1