Reputation: 6599
The Problem
I have two string vectors of different lengths. Each vector has a different set of strings. I want to find the strings that are in one vector but not in both; that is, the symmetric difference.
Analysis
I looked at the function setdiff, but its output depends on the order in which the vectors are considered. I found the custom function outersect, but this function requires the two vectors to be of the same length.
Any suggestions?
Correction
This issue seems to be specific to the data with which I am working. Otherwise, the answer below addresses the problem I mention in this post. I will look to see what is unique about my data and post back if I learn anything that might be helpful to other users.
Upvotes: 12
Views: 11174
Reputation: 33
This is an old question but if you want a faster function, you want to avoid Set Operations functions like setdiff
or union
because they are using duplicated
or unique
so you are basically repeating removing duplicates each time. Using match
and then removing duplicates at the end looks to be the fastest. For character vectors, using data.table::chmatch
is faster than match
.
library(data.table)
x1 <- janeaustenr::austen_books()$text |> sample(3e3)
x2 <- janeaustenr::austen_books()$text |> sample(3e3)
symdiff_dt <- function(x, y) {
c(
x[chmatch(x, y, 0L) == 0L],
y[chmatch(y, x, 0L) == 0L]
) |>
unique()
}
symdiff_match <- function(x, y) {
c(x[!x %in% y], y[!y %in% x]) |> unique()
}
symdiff_setdiff1 <- function(x, y) {
c(
setdiff(x, y),
setdiff(y, x)
) |>
unique()
}
symdiff_setdiff2 <- function(x, y) {
setdiff(
union(x, y),
intersect(x, y)
)
}
microbenchmark::microbenchmark(
symdiff_dt = symdiff_dt(x1, x2),
symdiff_match = symdiff_match(x1, x2),
symdiff_setdiff1 = symdiff_setdiff1(x1, x2),
symdiff_setdiff2 = symdiff_setdiff2(x1, x2),
check = "equal"
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> symdiff_dt 327.5 386.90 489.628 409.2 462.70 2876.3 100
#> symdiff_match 405.6 519.25 809.597 555.3 646.95 12428.0 100
#> symdiff_setdiff1 532.6 662.00 954.322 718.0 817.00 10741.9 100
#> symdiff_setdiff2 675.2 767.40 1040.671 823.0 946.00 10056.3 100
Created on 2024-01-04 with reprex v2.0.2
Upvotes: 0
Reputation: 52319
You can use symdiff
in dplyr
since 1.1.0
:
library(dplyr)
symdiff(1:3, 3:5)
#[1] 1 2 4 5
Upvotes: 2
Reputation: 76641
Here is another symmetric difference function, this one from the definition (that can be seen, for instance, in the Wikipedia page linked to in the question).
sym_diff3 <- function(a, b) union(setdiff(a, b), setdiff(b, a))
Including the function in the test run in this other answer by user sebpardo gives approximately the same timings, a little slower. Output omitted.
identical(sym_diff(cars1, cars2), sym_diff3(cars1, cars2))
#[1] TRUE
microbenchmark(sym_diff(cars1, cars2),
sym_diff2(cars1, cars2),
sym_diff3(cars1, cars2),
times = 10000L)
Upvotes: 3
Reputation: 707
Another option that is a bit faster is:
sym_diff2 <- function(a,b) unique(c(setdiff(a,b), setdiff(b,a)))
If we compare it with the answer by Blue Magister:
sym_diff <- function(a,b) setdiff(union(a,b), intersect(a,b))
library(microbenchmark)
library(MASS)
set.seed(1)
cars1 <- sample(Cars93$Make, 70)
cars2 <- sample(Cars93$Make, 70)
microbenchmark(sym_diff(cars1, cars2), sym_diff2(cars1, cars2), times = 10000L)
>Unit: microseconds
> expr min lq mean median uq max neval
>sym_diff(cars1, cars2) 114.719 119.7785 150.7510 125.0410 131.177 12382.02 10000
>sym_diff2(cars1, cars2) 94.369 100.0205 121.6051 103.8285 109.239 12013.69 10000
identical(sym_diff(cars1, cars2), sym_diff2(cars1, cars2))
>[1] TRUE
The speed difference between these two methods increases when the samples compared are larger (thousands or more), but I couldn't find an example dataset to use with that many variables.
Upvotes: 10
Reputation: 13363
Why not:
sym_diff <- function(a,b) setdiff(union(a,b), intersect(a,b))
Upvotes: 23