Reputation: 515

Identifying duplicates with different IDs

I have the following data:

set.seed(26312)
id <- rep(c(1, 2, 3, 4, 5), each = 9)
wrc <- round(runif(36, 20, 100))
wrc <- c(wrc, wrc[10:18])
x <- rep(1:9, 5)
dat <- data.frame(id, wrc, x)

In this data set, id 2 and id 5 contain the exact same data but with different IDs. This can be verified by running,

dat[dat$id == 2, ]
dat[dat$id == 5, ]

I have a much larger data set, with 4321 IDs, and I want to remove these duplicates because even though they have different IDs, they really are duplicates.

Presently I am do a combo of really awful and extremely slow for() and while() loops. In English, what the code is doing is subsetting an id and then comparing that id to every other id that I have subsetted within a while loop. When I find a duplicate, meaning all the rows of data are identical, it should throw away the first id that is a duplicate. The resulting cleaned_data is what I want, it is just unbearable slow to get there. Because it takes roughly 1 minute to do a comparison when I have 4321 ids, so that's about 4321 minutes to run this awful loop. Can someone help?

library("dplyr")
id_check = 1:5
cleaned_data <- data.frame()
for(i in id_check){
  compare_tmp <- dat %>% filter(id == i) 
  compare_check <- compare_tmp %>% select(wrc, x)
  duplicate = FALSE
  if(i == length(id_check)){
    cleaned_data <- rbind(cleaned_data, compare_tmp)
    break
  } else {
    id_tmp = i + 1
  }
  while(duplicate == FALSE){
    check <- dat %>% filter(id == id_tmp) %>% select(wrc, x)
    if(nrow(check) == 0) break
    duplicate = identical(compare_check, check)
    id_tmp = id_tmp + 1
    if(id_tmp == (length(id_check) + 1)) {
      break
    }
  }
  if(duplicate == FALSE){
    cleaned_data <- rbind(cleaned_data, compare_tmp)
  } 
}
cleaned_data

This is in response to why duplicated won't work. Below ids 2 and 5 are not the same subjects because there data aren't always identical.

set.seed(26312)
id <- rep(c(1, 2, 3, 4, 5), each = 9)
wrc <- round(runif(36, 20, 100))
wrc <- c(wrc, wrc[c(1, 11:18)])
x <- rep(1:9, 5)
dat <- data.frame(id, wrc, x)
dat[dat$id == 2,]
dat[dat$id == 5,]

If I run dat[!duplicated(dat[2:3]),] it removes id 5, when it shouldn't.

Upvotes: 1

Answers (3)

Aurèle

Reputation: 12849

Maybe something along the lines of:

do.call(
  rbind,
  split(dat, dat$id)[!duplicated(lapply(split(dat[2:3], dat$id), `rownames<-`, NULL), fromLast = TRUE)]
)

This splits by id, identifies duplicates, then binds again the non-duplicates.

Edit
Since time is of the essence here, I ran a benchmark of the solutions so far:

set.seed(26312)
p <- 4321
id <- rep(1:p, each = 9)
dats <- replicate(p %/% 2, round(runif(9, 20, 100)), simplify = FALSE)
wrc <- unlist(sample(dats, p, replace = TRUE))
x <- rep(1:9, times = p)
dat <- data.frame(id, wrc, x)

microbenchmark::microbenchmark(
  base = {
    do.call(
      rbind,
      split(dat, dat$id)[!duplicated(lapply(split(dat[2:3], dat$id), `rownames<-`, NULL), fromLast = TRUE)]
    )
  },
  tidyr = {
    as_tibble(dat) %>% 
      nest(-id) %>%
      filter(!duplicated(data, fromLast = TRUE)) %>% 
      unnest()
  },
  reshape = {
    dat_wide = reshape2::dcast(dat, id ~ x, value.var = "wrc")
    dupes = dat_wide$id[duplicated(dat_wide[-1], fromLast = T)]
    no_dupes = dat[!dat$id %in% dupes, ]
  },
  times = 10L
)

# Unit: milliseconds
#     expr      min        lq       mean     median         uq        max neval cld
#     base 892.8239 980.36553 1090.87505 1096.12514 1187.98810 1232.47244    10   c
#    tidyr 944.8156 953.10558  977.71756  976.83703  990.58672 1033.27664    10  b 
#  reshape  49.9955  50.13347   52.20539   51.91833   53.91568   55.64506    10 a

Upvotes: 3

Aurèle

Reputation: 12849

With tidyr:

library(tidyr)
library(dplyr)

as_tibble(dat) %>% 
  nest(-id) %>%
  filter(!duplicated(data, fromLast = TRUE)) %>% 
  unnest()

# # A tibble: 36 x 3
#       id   wrc     x
#    <dbl> <dbl> <int>
#  1     1    53     1
#  2     1    44     2
#  3     1    70     3
#  4     1    31     4
#  5     1    67     5
#  6     1    50     6
#  7     1    70     7
#  8     1    40     8
#  9     1    52     9
# 10     3    95     1
# # ... with 26 more rows

(Note: not sure about the Stackoverflow policy about multiple answers, but this one is different enough to deserve a separate answer IMHO (if it's not, please say so and I'll edit my initial answer and delete this one).

Upvotes: 2

Gregor Thomas

Reputation: 146224

If the column structure is accurate, you could convert to wide format for duplicate detection:

dat_wide = reshape2::dcast(dat, id ~ x, value.var = "wrc")
dupes = dat_wide$id[duplicated(dat_wide[-1], fromLast = T)]

no_dupes = dat[!dat$id %in% dupes, ]

Upvotes: 3

Identifying duplicates with different IDs

Answers (3)

Related Questions