agatha
agatha

Reputation: 1513

Remove duplicates based on 2nd column condition

I am trying to remove duplicate rows from a data frame based on the max value on a different column

So, for the data frame:

df<-data.frame (rbind(c("a",2,3),c("a",3,4),c("a",3,5),c("b",1,3),c("b",2,6),c("r",4,5))
  colnames(df)<-c("id","val1","val2")

id val1 val2

  a    2    3

  a    3    4

  a    3    5

  b    1    3

  b    2    6

  r    4    5

I would like to keep remove all duplicates by id with the condition that for the corresponding rows they do not have the maximum value for val2.

Thus the data frame should become:

  a    3    5

  b    2    6

  r    4    5

-> remove all a duplicates but keep row with the max value for df$val2 for subset(df, df$id=="a")

Upvotes: 19

Views: 15821

Answers (4)

JLCEBRIAN
JLCEBRIAN

Reputation: 11

Another one

df %>% group_by(id) %>%
         slice(which.max(val2))

  id   val1  val2
   a    3     5
   b    2     6
   r    4     5

Upvotes: 1

Rich Scriven
Rich Scriven

Reputation: 99321

Here's how I hope your data is really set up

df <- data.frame (id = c(rep("a", 3), rep("b", 2), "r"),
                  val1 = c(2, 3, 3, 1, 2, 4), val2 = c(3, 4, 5, 3, 6, 5))

You could do a split-unsplit

> unsplit(lapply(split(df, df$id), function(x) {
      if(nrow(x) > 1) {
          x[duplicated(x$id) & x$val2 == max(x$val2),]
      } else {
          x
      }
  }), levels(df$id))
#   id val1 val2
# 3  a    3    5
# 5  b    2    6
# 6  r    4    5

You can also use Reduce(rbind, ...) or do.call(rbind, ...) in place of unsplit

Upvotes: 2

akrun
akrun

Reputation: 886938

Using base R. Here, the columns are factors. Make sure to convert it to numeric

 df$val2 <- as.numeric(as.character(df$val2))
 df[with(df, ave(val2, id, FUN=max)==val2),]
 #  id val1 val2
 #3  a    3    5
 #5  b    2    6
 #6  r    4    5

Or using dplyr

 library(dplyr)
 df %>% 
    group_by(id) %>% 
    filter(val2==max(val2))
 #   id val1 val2
 #1  a    3    5
 #2  b    2    6
 #3  r    4    5

Upvotes: 19

David Arenburg
David Arenburg

Reputation: 92282

One possible way is to use data.table

library(data.table)
setDT(df)[, .SD[which.max(val2)], by = id]
##    id val1 val2
## 1:  a    3    5
## 2:  b    2    6
## 3:  r    4    5

Upvotes: 8

Related Questions