Reputation: 25
I am doing an external exercise where I have a set of data of customers' purchases.
I have the following columns: customer_id
, date
, gender
, value
(purchase value). One part of the exercise is to create a new column named most_recent_order_date
. How should I go about accomplishing this?
I tried
df['most_recent_order_date']=df.sort_values('customer_id',ascending=False)['date']
but this only returns the dates of all purchases in ascending order. I need it to be customer_id
specific since a customer_id
might have multiple purchases.
Another part of the exercise is to create a order_count
column which is what the last column is.
data= pd.read_csv('screening_exercise_orders_v201810.csv')
df=pd.DataFrame(data)
df['most_recent_order_date']= 'default value'
df['order_count']= 'default value'
df['date'] = pd.to_datetime(df['date'])
df['most_recent_order_date']=df.sort_values('customer_id',ascending=False)['date']
df['order_count']= df.groupby(['customer_id']).transform('count')
df.head(10)
I expect something like:
0 1000 0 2017-01-01 00:11:31 198.50 1 2017-02-10 00:11: 1
1 1001 0 2017-01-01 00:29:56 338.00 1 2017-11-01 00:29:56 1
2 1002 1 2017-01-01 01:30:31 733.00 1 2017-06-11 01:30:31 3
3 1003 1 2017-01-01 01:34:22 772.00 1 2017-05-14 01:34:22 4
4 1004 0 2017-01-01 03:11:54 508.00 1 2017-01-01 03:11:54 1
But what I actually get is:
0 1000 0 2017-01-01 00:11:31 198.50 1 2017-01-01 00:11:31 1
1 1001 0 2017-01-01 00:29:56 338.00 1 2017-01-01 00:29:56 1
2 1002 1 2017-01-01 01:30:31 733.00 1 2017-01-01 01:30:31 3
3 1003 1 2017-01-01 01:34:22 772.00 1 2017-01-01 01:34:22 4
4 1004 0 2017-01-01 03:11:54 508.00 1 2017-01-01 03:11:54 1
Upvotes: 1
Views: 531
Reputation: 18647
For most recent date, use groupby.transform
with max
:
df['date'] = pd.to_datetime(df['date'])
df['most_recent_date'] = df.groupby(['customer_id'])['date'].transform('max')
For count use groupby.cumcount
:
df['order_count'] = df.groupby(['customer_id']).cumcount().add(1)
Upvotes: 1