vvv
vvv

Reputation: 337

Find first time a value occurs in the dataframe

I have a dataframe with year-quarter (e.g. 2015-Q4), the customer_ID, and amount booked, and many other columns irrelevant for now. I want to create a column that has the first time each customer made a booking. I tried this:

alldata.sort_values(by=['Total_Apps_Reseller_Bookings_USD', 'Year_Quarter'], 
                    ascending=[1, 1], 
                    inplace=True)
first_q = alldata[['Customer_ID', 'Year_Quarter']].groupby(by='Customer_ID').first()

but I am not sure it worked.

Also, I then want to have another column that tells me how many quarters after the first booking that booking was made. I failed using replace and dictionary, so I used a merge. I create an numeric id for each quarter of booking, and first quarter from above, and then subtract the two:

q_booking_num = pd.DataFrame({'Year_Quarter': x, 'First_Quarter_id': np.arange(28)})

alldata = pd.merge(alldata, q_booking_num, on='Year_Quarter', how='outer')
q_first_num = pd.DataFrame({'First_Quarter': x, 'First_Quarter_id': np.arange(28)})
alldata = pd.merge(alldata, q_first_num, on='First_Quarter', how='outer')

this doesn't seem to have worked at all as I see 'first quarters' that are after some bookings that were already made.

Upvotes: 4

Views: 2413

Answers (2)

Adam
Adam

Reputation: 56

For part 1:

I think you need to sort a little differently to get your desired outcome:

alldata.sort_values(by=['Customer_ID', 'Year_Quarter', 
                        'Total_Apps_Reseller_Bookings_USD'], 
                    ascending=[1, 1],inplace=True)
first_q = alldata[['Customer_ID','Year_Quarter']].groupby(by='Customer_ID').head(1)

For part 2:

Continuing off of part 1, you can merge the values back on to the original dataframe. At that point, you can write a custom function to subtract your date strings and then apply it to each row.

Something like:

def qt_sub(val, first):
    year_dif = val[0:4] - first[0:4]
    qt_dif = val[6] - first[6]
    return 4 * int(year_dif) + int(qt_dif)

alldata['diff_from_first'] = alldata.apply(lambda x: qt_sub(x['Year_Quarter'], 
                                                            x['First_Sale']),
                                           axis=1)

Upvotes: 0

Alexander
Alexander

Reputation: 109526

You need to specify which column to use for taking the first value:

first_q = (alldata[['Customer_ID','Year_Quarter']]
           .groupby(by='Customer_ID')
           .Year_Quarter
           .first()
          )

Here is some sample data for three customers:

df = pd.DataFrame({'customer_ID': [1, 
                                   2, 2, 
                                   3, 3, 3], 
                   'Year_Quarter': ['2010-Q1', 
                                    '2010-Q1', '2011-Q1', 
                                    '2010-Q1', '2011-Q1', '2012-Q1'], 
                   'Total_Apps_Reseller_Bookings_USD': [1, 
                                                        2, 3, 
                                                        4, 5, 6]})

Below, I convert text quarters (e.g. '2010-Q1') to a numeric equivalent by taking the int value of the first for characters (df.Year_Quarter.str[:4].astype(int)). I then multiply it by four and add the value of the quarter. This value is only used for differencing to determine the total number of quarters since the first order.

Next, I use transform on the groupby to take the min value of these quarters we just calculated. Using transform keeps this value in the same shape as the original dataframe.

I then calcualte the quarters_since_first_order as the difference between the quarter and the first quarter.

df['quarters'] = df.Year_Quarter.str[:4].astype(int) * 4 + df.Year_Quarter.str[-1].astype(int)
first_order_quarter_no = df.groupby('customer_ID').quarters.transform(min)
df['quarters_since_first_order'] = quarters - first_order_quarter_no
del df['quarters']  # Clean-up.

>>> df
   Total_Apps_Reseller_Bookings_USD Year_Quarter  customer_ID  quarters_since_first_order
0                                 1      2010-Q1            1                           0
1                                 2      2010-Q1            2                           0
2                                 3      2011-Q1            2                           4
3                                 4      2010-Q1            3                           0
4                                 5      2011-Q1            3                           4
5                                 6      2012-Q1            3                           8

Upvotes: 1

Related Questions