Reputation: 65550
I have two tables in pandas, a user
table and a history
table - the latter is essentially a log of all actions taken by users.
User table:
| user_id | source
0 | 1 | blog
1 | 2 | blog
2 | 3 | organic
History table:
| user_id | action_type | t_actioned
0 | 1 | 332 | 2018-08-04 12:35:23
1 | 1 | 453 | 2018-08-04 12:36:23
2 | 1 | 332 | 2018-08-04 12:38:23
3 | 2 | 452 | 2018-08-04 12:40:23
4 | 3 | 523 | 2018-08-04 12:41:23
5 | 2 | 452 | 2018-08-04 12:41:43
I'd like to find the timestamp of the latest action taken by each user, from the history
table, and add it to the user
table as a new column.
How can I do this?
Looking at this answer, I think it's something like:
# Get the latest action by user
first_action = history.group_by('user_id').agg(lambda df: df.sort('t_actioned')[-1:].values[0])
user.first_action = # join with first_action somehow?
But the agg
query doesn't work for me and I'm not sure what to do next.
Upvotes: 2
Views: 189
Reputation: 294358
You can take advantage of the fact that when construction a dictionary with multiple duplicate keys, you'll only keep the last one.
m = dict(history.sort_values('t_actioned').pipe(
lambda d: zip(d.user_id, d.t_actioned)))
user.assign(latest=user.user_id.map(m))
user_id source latest
0 1 blog 2018-08-04 12:38:23
1 2 blog 2018-08-04 12:41:43
2 3 organic 2018-08-04 12:41:23
The short version considering data is already sorted by 't_actioned'
user.assign(latest=user.user_id.map(dict(zip(history.user_id, history.t_actioned))))
Upvotes: 3
Reputation: 164713
First sort, drop duplicates and create a series from your history dataframe:
s = history.sort_values('t_actioned', ascending=False)\
.drop_duplicates('user_id')\
.set_index('user_id')['action_type']
Then map this to your user dataframe:
user['action_type'] = user['user_id'].map(s)
As pointed out in comments, if your log is already sorted, you can avoid sort_values
and use drop_duplicates('user_id', keep='last')
.
Upvotes: 4
Reputation: 6659
Another way:
history = history.groupby(['user_id']).apply(lambda x: x.sort_values('t_actioned', ascending = False))
history = history.drop_duplicates(subset = ['user_id'], keep = 'first')
user = pd.merge(user, history[['t_actioned']], on = 'user_id', how = 'left')
Output:
user_id source t_actioned
0 1 blog 2018-08-04 12:38:23
1 2 blog 2018-08-04 12:41:43
2 3 organic 2018-08-04 12:41:23
Upvotes: 0