user3237820
user3237820

Reputation: 211

Removing duplicate rows from a data frame in R, keeping those with a smaller/larger value

I am trying to remove duplicate rows in an R data frame, but I want the condition that the row with a smaller or larger value (not bothered for the purpose of this question) in a certain column should be kept.

I can remove duplicate rows normally (from either side) like this:

df = data.frame( x = c(1,1,2,3,4,5,5,6,1,2,3,3,4,5,6),
             y = c(rnorm(4),NA,rnorm(10)),
             id = c(rep(1,8), rep(2,7)))

splitID <- split(df , df$id)
lapply(splitID, function(x) x[!duplicated(x$x),] )

How can I condition the removal of duplicate rows?

Thanks!

Upvotes: 2

Views: 378

Answers (2)

Martin Morgan
Martin Morgan

Reputation: 46866

Use ave() to return a logical index to subset your data.frame

idx = as.logical(ave(df$y, df$x, df$id, FUN=fun))
df[idx,, drop=FALSE]

Some possible fun include

fun1 = function(x)
    !is.na(x) & !duplicated(x) & (x == min(x, na.rm=TRUE))

fun2 = function(x) {
    res = logical(length(x))
    res[which.min(x)] = TRUE
    res
}

The dplyr version of this might be

df %>% group_by(x, id) %>% filter(fun2(y))

Upvotes: 2

akrun
akrun

Reputation: 887221

We may need to order before applying the duplicated

lapply(splitID, function(x) x[!duplicated(x[order(x$x, x$y),]$x),] )

and for the reverse, i.e. keeping the larger values, order with decreasing = TRUE

Upvotes: 1

Related Questions