Phillip Cunningham
Phillip Cunningham

Reputation: 141

Group data by one column and select first occurences from two other columns

I need to .groupby() using customer, and then add a column for the date in which the customer made his/her first purchase, and add another column for the corresponding purchase amount.

Here is my code. I am doing the first part wrong and don't know how to do the second. I've tried .loc and .idxmin ....

mydata = [{'amount': 3200, 'close_date':'2013-03-31', 'customer': 'Customer 1',},
          {'amount': 1430, 'close_date':'2013-11-30', 'customer': 'Customer 1',},
          {'amount': 4320, 'close_date':'2014-03-31', 'customer': 'Customer 2',},
          {'amount': 2340, 'close_date':'2015-05-18', 'customer': 'Customer 2',},
          {'amount': 4320, 'close_date':'2015-06-29', 'customer': 'Customer 2',},]

df = pd.DataFrame(mydata)
df.close_date = pd.to_datetime(df.close_date)
df['first_date'] = df.groupby('customer')['close_date'].min().apply(lambda x: x.strftime('%Y-%m'))

Upvotes: 1

Views: 182

Answers (2)

Stefan
Stefan

Reputation: 42875

If you sort your data by close_date, you can do as follows:

df.sort_values('close_date').groupby(['customer'])['close_date', 'amount'].first()

           close_date  amount
customer                     
Customer 1 2013-03-31    3200
Customer 2 2014-03-31    4320

.sort_values() has been added in 0.17, used to be sort() (see docs).

Upvotes: 3

TomAugspurger
TomAugspurger

Reputation: 28936

Two steps.

First the day of first purchase:

In [34]: first = df.groupby('customer').close_date.min()

In [35]: first
Out[35]:
customer
Customer 1    2013-03-31
Customer 2    2014-03-31
Name: close_date, dtype: object

We'll use first as an indexer,

In [36]: idx = pd.MultiIndex.from_tuples(list(first.iteritems()), names=['customer', 'close_date'])

In [37]: idx
Out[37]:
MultiIndex(levels=[['Customer 1', 'Customer 2'], ['2013-03-31', '2014-03-31']],
           labels=[[0, 1], [0, 1]])

For a DataFrame with those two levels

In [38]: df2 = df.set_index(['customer', 'close_date'])

In [39]: df2.loc[idx]
Out[39]:                       
                         amount
customer   close_date
Customer 1 2013-03-31    3200
Customer 2 2014-03-31    4320

This is a series, you can use .unstack() to get back a DataFrame.

Upvotes: 0

Related Questions