Reputation: 491
I want to remove all duplicates across multiple vectors, leaving none. For example, for these vectors:
a <- c("dog", "fish", "cow")
b <- c("dog", "horse", "mouse")
c <- c("cat", "sheep", "mouse")
the expected result would be:
a <- c("fish", "cow")
b <- c("horse")
c <- c("cat", "sheep")
Is there a way to achieve this without concatenating the vectors and splitting them again?
Upvotes: 12
Views: 1215
Reputation: 52209
Yet another possibility with collapse::fduplicated(x, all = TRUE)
. Unlike base R's duplicated
, this function allows you to include all values that appear more than once:
lst <- list(a = a, b = b, c = c)
unstack(subset(stack(lst), !collapse::fduplicated(values, all = TRUE)))
# $a
# [1] "fish" "cow"
#
# $b
# [1] "horse"
#
# $c
# [1] "cat" "sheep"
Benchmark on a list of 100 elements of length 10: my answer using collapse
is the fastest (relative time shown). @Friede's base R answer is equally fast.
expression min median itr/sec mem_alloc n_itr
1 tmfmnk 5.51 5.95 445.47 44.72 10
2 Tic1 2.87 3.06 879.08 2.12 10
3 Tic2 27.05 26.28 98.60 59.35 10
4 Tic3 4.43 4.28 504.72 2.78 10
5 jay.sf 2931.20 2785.01 1.00 5925.16 10
6 Edward 28.03 27.67 98.86 56.71 10
7 Maël 1.00 1.00 2699.28 1.00 10
8 Friede 1.03 1.00 2568.27 1.25 10
code:
lst <- lapply(setNames(as.list(replicate(100, sample(combn(letters, m = 2, paste, collapse = ""), size = 10, replace = TRUE), simplify = FALSE)), paste0('A', 1:100)), c)
vec <- unlist(lst, use.names = FALSE)
bench::mark(
tmfmnk = sapply(lst, function(x) x[!x %in% vec[duplicated(vec)]]),
Tic1 = unstack(subset(stack(lst), ave(seq_along(values), values, FUN = length) == 1)),
Tic2 = lapply(seq_along(lst), \(k) setdiff(lst[[k]], unlist(lst[-k]))),
Tic3 = {v <- names(which(table(unlist(lst)) == 1))
lapply(lst, intersect, v)},
jay.sf = outer(seq_along(lst), seq_along(lst), Vectorize(\(i, j) setdiff(lst[[i]], unlist(lst[-j])))) |>
diag(),
Edward = lapply(seq_along(lst), \(x) lst[[x]][!lst[[x]] %in% unlist(lst[setdiff(seq_along(lst)[-x], x)])]),
Maël = unstack(subset(stack(lst), !collapse::fduplicated(values, all = TRUE))),
Friede = unstack(subset(stack(lst), !duplicated(values) & !duplicated(values, fromLast=TRUE))),
check = FALSE,
iterations = 10,
relative = TRUE
)
Returning a data.frame: if you are okay with returning a data.frame instead of a list, then this will get much faster. With collapse
, you can do:
library(collapse)
lst <- list(a = a, b = b, c = c)
dat <- qDF(pivot(lst))
dat[!fduplicated(dat$value, all = TRUE), ]
# variable value
# 2 a fish
# 3 a cow
# 5 b horse
# 7 c cat
# 8 c sheep
On my computer, collapse
is 3 times faster than the data.table
option DT[!duplicated(value) & !duplicated(value, fromLast = TRUE)]
.
Upvotes: 9
Reputation: 132864
If the concept of "duplicated" applies, these vectors are actually one dataset. You should just put them into one data structure and create "tidy data". I suggest using package data.table, especially if your dataset is large:
library(data.table)
DT <- data.table(a, b, c)
DT <- melt(DT, measure.vars = 1:3)
Then you can easily remove duplicated values.
DT[!duplicated(value) & !duplicated(value, fromLast = TRUE)]
# variable value
# <fctr> <char>
#1: a fish
#2: a cow
#3: b horse
#4: c cat
#5: c sheep
This approach assumes that your dataset isn't so large that the memory demand for the variable
column is an issue.
Upvotes: 3
Reputation: 19134
lst <- list(a,b,c)
lapply(seq_along(lst), \(x) lst[[x]][!lst[[x]] %in% unlist(lst[setdiff(seq_along(lst)[-x], x)])])
[[1]]
[1] "fish" "cow"
[[2]]
[1] "horse"
[[3]]
[1] "cat" "sheep"
This solution keeps duplicates within the same vector, and only removes them if duplicates exist across multiple vectors, as stated in the question. Eg., applying the function to
a <- c("dog", "fish", "dog")
b <- c("cow", "horse", "mouse")
c <- c("cat", "sheep", "mouse")
lst <- list(a,b,c); lst
gives
[[1]]
[1] "dog" "fish" "dog"
[[2]]
[1] "cow" "horse"
[[3]]
[1] "cat" "sheep"
while other answers give
[[1]]
[1] "fish"
[[2]]
[1] "cow" "horse"
[[3]]
[1] "cat" "sheep"
Upvotes: 6
Reputation: 7827
Coming late to the answer party.
Base R, doing !duplicated()
twice.
unstack(subset(stack(l), !duplicated(values) & !duplicated(values, fromLast=TRUE)))
$a
[1] "fish" "cow"
$b
[1] "horse"
$c
[1] "cat" "sheep"
This avoids *apply
-functions, Vectorize()
(which is mapply()
) and outer()
.
Data
l = list(a = c("dog", "fish", "cow"), b = c("dog", "horse", "mouse"), c = c("cat", "sheep", "mouse"))
Upvotes: 10
Reputation: 40051
You could perhaps do:
vec <- c(a, b, c)
sapply(list(a, b, c), function(x) x[!x %in% vec[duplicated(vec)]])
[[1]]
[1] "fish" "cow"
[[2]]
[1] "horse"
[[3]]
[1] "cat" "sheep"
If you need individual variables in the global environment, with the addition of lst()
from tibble
:
vec <- c(a, b, c)
l <- sapply(lst(a, b, c), function(x) x[!x %in% vec[duplicated(vec)]])
list2env(l, envir = .GlobalEnv)
Upvotes: 12
Reputation: 73262
Using setdiff
in outer
. diag
gives the result.
> lst <- list(a, b, c)
> outer(seq_along(lst), seq_along(lst),
+ Vectorize(\(i, j) setdiff(lst[[i]], unlist(lst[-j])))) |>
+ diag()
[[1]]
[1] "fish" "cow"
[[2]]
[1] "horse"
[[3]]
[1] "cat" "sheep"
Upvotes: 6
Reputation: 102241
Given data in a list, e.g., lst <- list(a = a, b = b, c = c)
, you can try
> unstack(subset(stack(lst), ave(seq_along(values), values, FUN = length) == 1))
$a
[1] "fish" "cow"
$b
[1] "horse"
$c
[1] "cat" "sheep"
> lapply(seq_along(lst), \(k) setdiff(lst[[k]], unlist(lst[-k])))
[[1]]
[1] "fish" "cow"
[[2]]
[1] "horse"
[[3]]
[1] "cat" "sheep"
> v <- names(which(table(unlist(lst)) == 1))
> lapply(lst, intersect, v)
$a
[1] "fish" "cow"
$b
[1] "horse"
$c
[1] "cat" "sheep"
Upvotes: 9