Reputation: 119

Removing duplicate rows with condition about other column in R

I need a help with sort and delete rows in dataframe.

Example data:

data <- data.frame(
  CODE1 = c(111,111,111,666,666,777,777), 
  CODE2 = c(100,101,102,102,103,103,104),
  VALUE = c(3,5,5,3,2,2,1))

I want to remove duplicate rows from column CODE1 and the smallest value from column VALUE.

Result:

output <- data.frame(
  CODE1 = c(111,111,666,777), 
  CODE2 = c(101,102,102,103),
  VALUE = c(5,5,3,2))

I used code like this:

 data <- data[order(data$CODE1, -data$VALUE),]
 data <- data[!duplicated(data$CODE1),]

But this causes a random element to be deleted if the largest VALUE occurs 2 or more times. Does anyone know what can I do?

Upvotes: 1

Answers (2)

r2evans

Reputation: 160827

I don't think you can have "remove duplicate rows from column CODE1" while at the same time only dropping the smallest VALUE.

Try these.

All of them include a check to determine if there is only one value in the group; if a group had just 1, then it would always be filtered out (since min(value) == value, using the "min" approach).

base R

do.call(rbind, by(dat, dat$code1, function(z) z[nrow(z) == 1L | rank(z$value) > 1,]))
#       code1 code2 value
# 111.2   111   101     5
# 111.3   111   102     5
# 666     666   102     3
# 777     777   103     2

Inspired by ThomasIsCoding's use of subset/ave, here's an alternative:

subset(dat,
      ave(value, code1, FUN = function(z) if (length(z) > 1) rank(z) else 2) > 1)

dplyr

library(dplyr)
dat %>%
  group_by(code1) %>%
  filter(n() == 1L | dense_rank(value) > 1) %>%
  ungroup()
# # A tibble: 4 x 3
#   code1 code2 value
#   <chr> <chr> <dbl>
# 1 111   101       5
# 2 111   102       5
# 3 666   102       3
# 4 777   103       2

data.table

library(data.table)
DT <- as.data.table(dat)
DT[, .SD[ .N == 1L | frank(value) > 1, ], by = .(code1) ]
#     code1  code2 value
#    <char> <char> <num>
# 1:    111    101     5
# 2:    111    102     5
# 3:    666    102     3
# 4:    777    103     2

Data

dat <- data.frame(code1=c('111','111','111','666','666','777','777'), code2=c('100','101','102','102','103','103','104'), value=c(3,5,5,3,2,2,1))

Upvotes: 1

ThomasIsCoding

Reputation: 102625

A simple base R option with subset + ave

subset(
  data,
  ave(VALUE,CODE1,FUN = min)!=VALUE
)

gives

  CODE1 CODE2 VALUE
2   111   101     5
3   111   102     5
4   666   102     3
6   777   103     2

A data.table option

> setDT(data)[,.SD[VALUE!=min(VALUE)],CODE1]
   CODE1 CODE2 VALUE
1:   111   101     5
2:   111   102     5
3:   666   102     3
4:   777   103     2

A dplyr option

> data %>%
+   group_by(CODE1) %>%
+   filter(min(VALUE)!=VALUE)
# A tibble: 4 x 3
# Groups:   CODE1 [3]
  CODE1 CODE2 VALUE
  <dbl> <dbl> <dbl>
1   111   101     5
2   111   102     5
3   666   102     3
4   777   103     2

Upvotes: 1

Removing duplicate rows with condition about other column in R

Answers (2)

base R

dplyr

data.table

Related Questions