Reputation: 515
I have the following data:
set.seed(26312)
id <- rep(c(1, 2, 3, 4, 5), each = 9)
wrc <- round(runif(36, 20, 100))
wrc <- c(wrc, wrc[10:18])
x <- rep(1:9, 5)
dat <- data.frame(id, wrc, x)
In this data set, id 2 and id 5 contain the exact same data but with different IDs. This can be verified by running,
dat[dat$id == 2, ]
dat[dat$id == 5, ]
I have a much larger data set, with 4321 IDs, and I want to remove these duplicates because even though they have different IDs, they really are duplicates.
Presently I am do a combo of really awful and extremely slow for()
and while()
loops. In English, what the code is doing is subsetting an id and then comparing that id to every other id that I have subsetted within a while loop. When I find a duplicate, meaning all the rows of data are identical, it should throw away the first id that is a duplicate. The resulting cleaned_data
is what I want, it is just unbearable slow to get there. Because it takes roughly 1 minute to do a comparison when I have 4321 ids, so that's about 4321 minutes to run this awful loop. Can someone help?
library("dplyr")
id_check = 1:5
cleaned_data <- data.frame()
for(i in id_check){
compare_tmp <- dat %>% filter(id == i)
compare_check <- compare_tmp %>% select(wrc, x)
duplicate = FALSE
if(i == length(id_check)){
cleaned_data <- rbind(cleaned_data, compare_tmp)
break
} else {
id_tmp = i + 1
}
while(duplicate == FALSE){
check <- dat %>% filter(id == id_tmp) %>% select(wrc, x)
if(nrow(check) == 0) break
duplicate = identical(compare_check, check)
id_tmp = id_tmp + 1
if(id_tmp == (length(id_check) + 1)) {
break
}
}
if(duplicate == FALSE){
cleaned_data <- rbind(cleaned_data, compare_tmp)
}
}
cleaned_data
This is in response to why duplicated won't work. Below ids 2 and 5 are not the same subjects because there data aren't always identical.
set.seed(26312)
id <- rep(c(1, 2, 3, 4, 5), each = 9)
wrc <- round(runif(36, 20, 100))
wrc <- c(wrc, wrc[c(1, 11:18)])
x <- rep(1:9, 5)
dat <- data.frame(id, wrc, x)
dat[dat$id == 2,]
dat[dat$id == 5,]
If I run dat[!duplicated(dat[2:3]),]
it removes id 5, when it shouldn't.
Upvotes: 1
Views: 101
Reputation: 12849
Maybe something along the lines of:
do.call(
rbind,
split(dat, dat$id)[!duplicated(lapply(split(dat[2:3], dat$id), `rownames<-`, NULL), fromLast = TRUE)]
)
This splits by id
, identifies duplicates, then binds again the non-duplicates.
Edit
Since time is of the essence here, I ran a benchmark of the solutions so far:
set.seed(26312)
p <- 4321
id <- rep(1:p, each = 9)
dats <- replicate(p %/% 2, round(runif(9, 20, 100)), simplify = FALSE)
wrc <- unlist(sample(dats, p, replace = TRUE))
x <- rep(1:9, times = p)
dat <- data.frame(id, wrc, x)
microbenchmark::microbenchmark(
base = {
do.call(
rbind,
split(dat, dat$id)[!duplicated(lapply(split(dat[2:3], dat$id), `rownames<-`, NULL), fromLast = TRUE)]
)
},
tidyr = {
as_tibble(dat) %>%
nest(-id) %>%
filter(!duplicated(data, fromLast = TRUE)) %>%
unnest()
},
reshape = {
dat_wide = reshape2::dcast(dat, id ~ x, value.var = "wrc")
dupes = dat_wide$id[duplicated(dat_wide[-1], fromLast = T)]
no_dupes = dat[!dat$id %in% dupes, ]
},
times = 10L
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# base 892.8239 980.36553 1090.87505 1096.12514 1187.98810 1232.47244 10 c
# tidyr 944.8156 953.10558 977.71756 976.83703 990.58672 1033.27664 10 b
# reshape 49.9955 50.13347 52.20539 51.91833 53.91568 55.64506 10 a
Upvotes: 3
Reputation: 12849
With tidyr
:
library(tidyr)
library(dplyr)
as_tibble(dat) %>%
nest(-id) %>%
filter(!duplicated(data, fromLast = TRUE)) %>%
unnest()
# # A tibble: 36 x 3
# id wrc x
# <dbl> <dbl> <int>
# 1 1 53 1
# 2 1 44 2
# 3 1 70 3
# 4 1 31 4
# 5 1 67 5
# 6 1 50 6
# 7 1 70 7
# 8 1 40 8
# 9 1 52 9
# 10 3 95 1
# # ... with 26 more rows
(Note: not sure about the Stackoverflow policy about multiple answers, but this one is different enough to deserve a separate answer IMHO (if it's not, please say so and I'll edit my initial answer and delete this one).
Upvotes: 2
Reputation: 146224
If the column structure is accurate, you could convert to wide format for duplicate detection:
dat_wide = reshape2::dcast(dat, id ~ x, value.var = "wrc")
dupes = dat_wide$id[duplicated(dat_wide[-1], fromLast = T)]
no_dupes = dat[!dat$id %in% dupes, ]
Upvotes: 3