Ben
Ben

Reputation: 491

Remove duplicates across multiple vectors

I want to remove all duplicates across multiple vectors, leaving none. For example, for these vectors:

a <- c("dog", "fish", "cow")
b <- c("dog", "horse", "mouse")
c <- c("cat", "sheep", "mouse")

the expected result would be:

a <- c("fish", "cow")
b <- c("horse")
c <- c("cat", "sheep")

Is there a way to achieve this without concatenating the vectors and splitting them again?

Upvotes: 12

Views: 1215

Answers (7)

Ma&#235;l
Ma&#235;l

Reputation: 52209

Yet another possibility with collapse::fduplicated(x, all = TRUE). Unlike base R's duplicated, this function allows you to include all values that appear more than once:

lst <- list(a = a, b = b, c = c)
unstack(subset(stack(lst), !collapse::fduplicated(values, all = TRUE)))

# $a
# [1] "fish" "cow" 
# 
# $b
# [1] "horse"
# 
# $c
# [1] "cat"   "sheep"

Benchmark on a list of 100 elements of length 10: my answer using collapse is the fastest (relative time shown). @Friede's base R answer is equally fast.

  expression     min  median itr/sec mem_alloc n_itr
1     tmfmnk    5.51    5.95  445.47     44.72    10
2       Tic1    2.87    3.06  879.08      2.12    10
3       Tic2   27.05   26.28   98.60     59.35    10
4       Tic3    4.43    4.28  504.72      2.78    10
5     jay.sf 2931.20 2785.01    1.00   5925.16    10
6     Edward   28.03   27.67   98.86     56.71    10
7       Maël    1.00    1.00 2699.28      1.00    10
8     Friede    1.03    1.00 2568.27      1.25    10

code:

lst <- lapply(setNames(as.list(replicate(100, sample(combn(letters, m = 2, paste, collapse = ""), size = 10, replace = TRUE), simplify = FALSE)), paste0('A', 1:100)), c)
vec <- unlist(lst, use.names = FALSE)

bench::mark(
  tmfmnk = sapply(lst, function(x) x[!x %in% vec[duplicated(vec)]]),
  Tic1 = unstack(subset(stack(lst), ave(seq_along(values), values, FUN = length) == 1)),
  Tic2 = lapply(seq_along(lst), \(k) setdiff(lst[[k]], unlist(lst[-k]))),
  Tic3 = {v <- names(which(table(unlist(lst)) == 1))
  lapply(lst, intersect, v)},
  jay.sf = outer(seq_along(lst), seq_along(lst), Vectorize(\(i, j) setdiff(lst[[i]], unlist(lst[-j])))) |>
    diag(),
  Edward = lapply(seq_along(lst), \(x) lst[[x]][!lst[[x]] %in% unlist(lst[setdiff(seq_along(lst)[-x], x)])]),
  Maël = unstack(subset(stack(lst), !collapse::fduplicated(values, all = TRUE))),
  Friede = unstack(subset(stack(lst), !duplicated(values) & !duplicated(values, fromLast=TRUE))),
  check = FALSE,
  iterations = 10,
  relative = TRUE
)

Returning a data.frame: if you are okay with returning a data.frame instead of a list, then this will get much faster. With collapse, you can do:

library(collapse)
lst <- list(a = a, b = b, c = c)
dat <- qDF(pivot(lst))

dat[!fduplicated(dat$value, all = TRUE), ]

#   variable value
# 2        a  fish
# 3        a   cow
# 5        b horse
# 7        c   cat
# 8        c sheep

On my computer, collapse is 3 times faster than the data.table option DT[!duplicated(value) & !duplicated(value, fromLast = TRUE)].

Upvotes: 9

Roland
Roland

Reputation: 132864

If the concept of "duplicated" applies, these vectors are actually one dataset. You should just put them into one data structure and create "tidy data". I suggest using package data.table, especially if your dataset is large:

library(data.table)
DT <- data.table(a, b, c)
DT <- melt(DT, measure.vars = 1:3)

Then you can easily remove duplicated values.

DT[!duplicated(value) & !duplicated(value, fromLast = TRUE)]
#   variable  value
#     <fctr> <char>
#1:        a   fish
#2:        a    cow
#3:        b  horse
#4:        c    cat
#5:        c  sheep

This approach assumes that your dataset isn't so large that the memory demand for the variable column is an issue.

Upvotes: 3

Edward
Edward

Reputation: 19134

lst <- list(a,b,c)

lapply(seq_along(lst), \(x) lst[[x]][!lst[[x]] %in% unlist(lst[setdiff(seq_along(lst)[-x], x)])])

[[1]]
[1] "fish" "cow" 

[[2]]
[1] "horse"

[[3]]
[1] "cat"   "sheep"

This solution keeps duplicates within the same vector, and only removes them if duplicates exist across multiple vectors, as stated in the question. Eg., applying the function to

a <- c("dog", "fish", "dog")
b <- c("cow", "horse", "mouse")
c <- c("cat", "sheep", "mouse")

lst <- list(a,b,c); lst

gives

[[1]]
[1] "dog"  "fish" "dog" 

[[2]]
[1] "cow"   "horse"

[[3]]
[1] "cat"   "sheep"

while other answers give

[[1]]
[1] "fish"

[[2]]
[1] "cow"   "horse"

[[3]]
[1] "cat"   "sheep"

Upvotes: 6

Friede
Friede

Reputation: 7827

Coming late to the answer party.

Base R, doing !duplicated() twice.

unstack(subset(stack(l), !duplicated(values) & !duplicated(values, fromLast=TRUE)))
$a
[1] "fish" "cow" 

$b
[1] "horse"

$c
[1] "cat"   "sheep"

This avoids *apply-functions, Vectorize() (which is mapply()) and outer().

Data

l = list(a = c("dog", "fish", "cow"), b = c("dog", "horse", "mouse"), c = c("cat", "sheep", "mouse"))

Upvotes: 10

tmfmnk
tmfmnk

Reputation: 40051

You could perhaps do:

vec <- c(a, b, c)
sapply(list(a, b, c), function(x) x[!x %in% vec[duplicated(vec)]])

[[1]]
[1] "fish" "cow" 

[[2]]
[1] "horse"

[[3]]
[1] "cat"   "sheep"

If you need individual variables in the global environment, with the addition of lst() from tibble:

vec <- c(a, b, c)
l <- sapply(lst(a, b, c), function(x) x[!x %in% vec[duplicated(vec)]])
list2env(l, envir = .GlobalEnv)

Upvotes: 12

jay.sf
jay.sf

Reputation: 73262

Using setdiff in outer. diag gives the result.

> lst <- list(a, b, c)
> outer(seq_along(lst), seq_along(lst), 
+       Vectorize(\(i, j) setdiff(lst[[i]], unlist(lst[-j])))) |>
+   diag()
[[1]]
[1] "fish" "cow" 

[[2]]
[1] "horse"

[[3]]
[1] "cat"   "sheep"

Upvotes: 6

ThomasIsCoding
ThomasIsCoding

Reputation: 102241

Given data in a list, e.g., lst <- list(a = a, b = b, c = c), you can try

  • Option 1
> unstack(subset(stack(lst), ave(seq_along(values), values, FUN = length) == 1))
$a
[1] "fish" "cow"

$b
[1] "horse"

$c
[1] "cat"   "sheep"
  • Option 2
> lapply(seq_along(lst), \(k) setdiff(lst[[k]], unlist(lst[-k])))
[[1]]
[1] "fish" "cow"

[[2]]
[1] "horse"

[[3]]
[1] "cat"   "sheep"
  • Option 3
> v <- names(which(table(unlist(lst)) == 1))

> lapply(lst, intersect, v)
$a
[1] "fish" "cow"

$b
[1] "horse"

$c
[1] "cat"   "sheep"

Upvotes: 9

Related Questions