Keith Allen
Keith Allen

Reputation: 153

select multiple rows from a dataframe with a value equal to highest value by group

I have a data frame that looks like this:

query <- c('a','a','a','b','b','b','c','c','c')
hit <- c(1,2,3,4,5,6,7,8,9)
score <- c(97,97,96,98,97,96,99,99,98)
df <- data.frame(query,hit,score)
df

  query hit score
1     a   1    97
2     a   2    97
3     a   3    96
4     b   4    98
5     b   5    97
6     b   6    96
7     c   7    99
8     c   8    99
9     c   9    98

I want to group on the first column, and select all rows with a score equal to the highest score for that group. About the closest I can figure out is to use top_n like this:

df %>%
+ group_by(query) %>%
+ top_n(2,score)

A tibble: 6 x 3

Groups: query [3]

   query   hit score
  <fctr> <dbl> <dbl>
1      a     1    97
2      a     2    97
3      b     4    98
4      b     5    97
5      c     7    99
6      c     8    99

But obviously all that's doing is giving me the top two (or whatever I specify). The result I want to end up with would look more like this:

   query   hit score
  <fctr> <dbl> <dbl>
1      a     1    97
2      a     2    97
3      b     4    98
5      c     7    99
6      c     8    99

As usual, I assume I'm missing something pretty simple.

Upvotes: 1

Views: 657

Answers (2)

Jake Thompson
Jake Thompson

Reputation: 2843

Your syntax is basically correct, just specify n = 1 instead of n = 2 in top_n.

query <- c('a','a','a','b','b','b','c','c','c')
hit <- c(1,2,3,4,5,6,7,8,9)
score <- c(97,97,96,98,97,96,99,99,98)
df <- data.frame(query,hit,score)

df %>%
  group_by(query) %>%
  top_n(n = 1, wt = score)
#> # A tibble: 5 x 3
#> # Groups:   query [3]
#>    query   hit score
#>   <fctr> <dbl> <dbl>
#> 1      a     1    97
#> 2      a     2    97
#> 3      b     4    98
#> 4      c     7    99
#> 5      c     8    99

When using top_n, if there is a tie, all observations with that score will be returned. So you can specify that you want the 1 highest score (n = 1), and then all observations with that score, within each group, will be returned.

Upvotes: 4

Rich Scriven
Rich Scriven

Reputation: 99321

In dplyr, just filter on score == max(score):

group_by(df, query) %>%
    filter(score == max(score))
# A tibble: 5 x 3
# Groups:   query [3]
#    query   hit score
#   <fctr> <dbl> <dbl>
# 1      a     1    97
# 2      a     2    97
# 3      b     4    98
# 4      c     7    99
# 5      c     8    99

You could also easily do this in base R, with ave():

df[with(df, ave(score, query, FUN = max) == score), ]
#   query hit score
# 1     a   1    97
# 2     a   2    97
# 4     b   4    98
# 7     c   7    99
# 8     c   8    99

Upvotes: 4

Related Questions