Reputation: 4483

pandas: how do I select first row in each GROUP BY group?

Basically the same as Select first row in each GROUP BY group? only in pandas.

df = pd.DataFrame({'A' : ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar'],
                'B' : ['3', '1', '2', '4','2', '4', '1', '3'],
                    })

Sorting looks promising:

df.sort('B')

     A  B
1  foo  1
6  bar  1
2  foo  2
4  bar  2
0  foo  3
7  bar  3
3  foo  4
5  bar  4

But then first won't give the desired result... df.groupby('A').first()

     B
A     
bar  2
foo  3

Upvotes: 36

Answers (5)

fpersyn

Reputation: 1096

Use a sort method and then keep the first value using .drop_duplicates() with its keep argument set to 'first' (default). This approach has the benefit that it keeps your index.

Example:

df.sort_values('B', ascending=True) \
  .drop_duplicates('A', keep='first')

Output:

    A   B
1   foo 1
6   bar 1

You don’t need a GroupBy in this case because there is no need to run a calculation on each group (see: split-apply-combine pattern).

Upvotes: 5

Dídac Fernández

Reputation: 121

EdChum's answer may not always work as intended. Instead of first() use nth(0).

The method first() is affected by this bug that has gone unsolved for some years now. Instead of the expected behaviour, first() returns the first element that is not missing in each column within each group i.e. it ignores NaN values. For example, say you had a third column with some missing values:

df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'bar', 'bar'],
                   'B' : ['1', '2','2', '4', '1'],
                   'C' : [np.nan, 'X', 'Y', 'Y', 'Y']})

    A   B   C
0   foo 1   NaN
2   foo 2   X
3   bar 2   Y
4   bar 4   Y
5   bar 1   Y

Using first() here (after sorting, just like EdChum correctly assessed in their answer) will skip over the missing values (note how it is mixing up values from different rows):

df.sort_values('B').groupby('A').first()

    B   C
A       
bar 1   Y
foo 1   X

The correct way to get the full row, including missing values, is to use nth(0), which performs the expected operation:

df.sort_values('B').groupby('A').nth(0)

    B   C
A       
bar 1   Y
foo 1   NaN

For completeness, this bug also affects last(), its correct substitute being nth(-1).

Posting this in an answer since it's too long for a comment. Not sure this is within the scope of the question but I think it's relevant to many people looking for this answer (like myself before writing this) and is extremely easy to miss.

Upvotes: 12

JohnE

Reputation: 30424

Here's an alternative approach using groupby().rank():

df[ df.groupby('A')['B'].rank() == 1 ]

     A  B
1  foo  1
6  bar  1

This gives you the same answer as @EdChum's for the OP's sample dataframe, but could give a different answer if you have any ties during the sort, for example, with data like this:

df = pd.DataFrame({'A': ['foo', 'foo', 'bar', 'bar'], 
                   'B': ['2', '1', '1', '1'] })

In this case you have some options using the optional method argument, depending on how you wish to handle sorting ties:

df[ df.groupby('A')['B'].rank(method='average') == 1 ]   # the default
df[ df.groupby('A')['B'].rank(method='min')     == 1 ]
df[ df.groupby('A')['B'].rank(method='first')   == 1 ]   # doesn't work, not sure why

Upvotes: 16

EdChum

Reputation: 394041

Generally if you want your data sorted in a groupby but it's not one of the columns which are going to be grouped on then it's better to sort the df prior to performing groupby:

In [5]:
df.sort_values('B').groupby('A').first()

Out[5]:
     B
A     
bar  1
foo  1

Upvotes: 37

firelynx

Reputation: 32214

The pandas groupby function could be used for what you want, but it's really meant for aggregation. This is a simple 'take the first' operation.

What you actually want is the pandas drop_duplicates function, which by default will return the first row. What you usually would consider the groupby key, you should pass as the subset= variable

df.drop_duplicates(subset='A')

Should do what you want.

Also, df.sort('A') does not sort the DataFrame df, it returns a copy which is sorted. If you want to sort it, you have to add the inplace=True parameter.

df.sort('A', inplace=True)

Upvotes: 14

pandas: how do I select first row in each GROUP BY group?

Answers (5)

Related Questions