Reputation: 4483
Basically the same as Select first row in each GROUP BY group? only in pandas.
df = pd.DataFrame({'A' : ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar'],
'B' : ['3', '1', '2', '4','2', '4', '1', '3'],
})
Sorting looks promising:
df.sort('B')
A B
1 foo 1
6 bar 1
2 foo 2
4 bar 2
0 foo 3
7 bar 3
3 foo 4
5 bar 4
But then first won't give the desired result... df.groupby('A').first()
B
A
bar 2
foo 3
Upvotes: 36
Views: 53570
Reputation: 1096
Use a sort method and then keep the first value using .drop_duplicates()
with its keep
argument set to 'first'
(default). This approach has the benefit that it keeps your index.
Example:
df.sort_values('B', ascending=True) \
.drop_duplicates('A', keep='first')
Output:
A B
1 foo 1
6 bar 1
You don’t need a GroupBy
in this case because there is no need to run a calculation on each group (see: split-apply-combine pattern).
Upvotes: 5
Reputation: 121
EdChum's answer may not always work as intended. Instead of first()
use nth(0)
.
The method first()
is affected by this bug that has gone unsolved for some years now. Instead of the expected behaviour, first()
returns the first element that is not missing in each column within each group i.e. it ignores NaN values. For example, say you had a third column with some missing values:
df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'bar', 'bar'],
'B' : ['1', '2','2', '4', '1'],
'C' : [np.nan, 'X', 'Y', 'Y', 'Y']})
A B C
0 foo 1 NaN
2 foo 2 X
3 bar 2 Y
4 bar 4 Y
5 bar 1 Y
Using first()
here (after sorting, just like EdChum correctly assessed in their answer) will skip over the missing values (note how it is mixing up values from different rows):
df.sort_values('B').groupby('A').first()
B C
A
bar 1 Y
foo 1 X
The correct way to get the full row, including missing values, is to use nth(0)
, which performs the expected operation:
df.sort_values('B').groupby('A').nth(0)
B C
A
bar 1 Y
foo 1 NaN
For completeness, this bug also affects last()
, its correct substitute being nth(-1)
.
Posting this in an answer since it's too long for a comment. Not sure this is within the scope of the question but I think it's relevant to many people looking for this answer (like myself before writing this) and is extremely easy to miss.
Upvotes: 12
Reputation: 30424
Here's an alternative approach using groupby().rank()
:
df[ df.groupby('A')['B'].rank() == 1 ]
A B
1 foo 1
6 bar 1
This gives you the same answer as @EdChum's for the OP's sample dataframe, but could give a different answer if you have any ties during the sort, for example, with data like this:
df = pd.DataFrame({'A': ['foo', 'foo', 'bar', 'bar'],
'B': ['2', '1', '1', '1'] })
In this case you have some options using the optional method
argument, depending on how you wish to handle sorting ties:
df[ df.groupby('A')['B'].rank(method='average') == 1 ] # the default
df[ df.groupby('A')['B'].rank(method='min') == 1 ]
df[ df.groupby('A')['B'].rank(method='first') == 1 ] # doesn't work, not sure why
Upvotes: 16
Reputation: 394041
Generally if you want your data sorted in a groupby but it's not one of the columns which are going to be grouped on then it's better to sort
the df prior to performing groupby
:
In [5]:
df.sort_values('B').groupby('A').first()
Out[5]:
B
A
bar 1
foo 1
Upvotes: 37
Reputation: 32214
The pandas groupby function could be used for what you want, but it's really meant for aggregation. This is a simple 'take the first' operation.
What you actually want is the pandas drop_duplicates function, which by default will return the first row. What you usually would consider the groupby key, you should pass as the subset= variable
df.drop_duplicates(subset='A')
Should do what you want.
Also, df.sort('A')
does not sort the DataFrame df, it returns a copy which is sorted. If you want to sort it, you have to add the inplace=True
parameter.
df.sort('A', inplace=True)
Upvotes: 14