Reputation: 3779
I have a pandas dataframe with columns (among others) user_id
and start_time
. I want to efficiently and readably find all rows associated with each user's max start_time
.
For example, if this were my data:
user_id start_time A B C
1 37 a b c
1 45 d e f
1 45 g h i
2 58 j k l
2 17 m n o
2 58 p q r
3 2 s t u
then I would expect to find
user_id start_time A B C
1 45 d e f
1 45 g h i
2 58 j k l
2 58 p q r
3 2 s t u
I've been coming up with solutions a bit like Conditional selection of data in a pandas DataFrame, but that finds the user_id
with the latest start time
, not the selection of the table for per-user max start_time
's.
Of course, it's easy if I iterated the DataFrame by hand, but that is inefficient.
Thanks for any pointers.
For convenience to future readers, generate the dataframe thus:
columns = ['user_id', 'start_time', 'A', 'B', 'C']
LoL = [
[1, 37, 'a', 'b', 'c'],
[1, 45, 'd', 'e', 'f'],
[1, 45, 'g', 'h', 'i'],
[2, 58, 'j', 'k', 'l'],
[2, 17, 'm', 'n', 'o'],
[2, 58, 'p', 'q', 'r'],
[3, 2, 's', 't', 'u']]
pd.DataFrame = (LoL, columns=columns)
Upvotes: 3
Views: 929
Reputation: 10302
Alternatively you can do this:
df[df.start_time == df.groupby('user_id')['start_time'].transform(max)]
Which yields:
user_id start_time A B C
1 1 45 d e f
2 1 45 g h i
3 2 58 j k l
5 2 58 p q r
6 3 2 s t u
Upvotes: 1
Reputation: 394061
You can groupby the user_id column and then call apply
and pass a lambda which filters the results where the start time is equal to the max value, we want to generate a boolean index from this. We can then call reset_index
but due to the way the groupby was filtered we will get an error with duplicate columns so we have to drop this duplicate column:
In [66]:
gp = df.groupby('user_id')
gp.apply(lambda x: x[x['start_time'] == x['start_time'].max()]).reset_index(drop=True)
Out[66]:
user_id start_time A B C
0 1 45 d e f
1 1 45 g g i
2 2 58 j k l
3 2 58 p q r
4 3 2 s t u
if we didn't call reset_index
you get duplicated values:
In [67]:
gp.apply(lambda x: x[x['start_time'] == x['start_time'].max()])
Out[67]:
user_id start_time A B C
user_id
1 1 1 45 d e f
2 1 45 g g i
2 3 2 58 j k l
5 2 58 p q r
3 6 3 2 s t u
The inner boolean condition produces a boolean mask on the multi-index, this is then needed to pass to the lambda to produce the above:
In [68]:
gp.apply(lambda x: x['start_time'] == x['start_time'].max())
Out[68]:
user_id
1 0 False
1 True
2 True
2 3 True
4 False
5 True
3 6 True
Name: start_time, dtype: bool
Upvotes: 3