Reputation: 2617
Assume we have the following data frame
:
df <- data.frame(X1 = 1:5, X2 = 6:10, X3 = c(6, 2, 3, 0, 2))
X1 X2 X3
1 1 6 6
2 2 7 2
3 3 8 3
4 4 9 0
5 5 10 2
I would like to add a new column (X4
) made of logical values. For each row: if X3
is equal to X1
or X2
, then X4
should be TRUE
, otherwise FALSE
.
I tried:
mutate(df, X4 = X3 %in% c(X2, X1))
X1 X2 X3 X4
1 1 6 6 TRUE # OK
2 2 7 2 TRUE # OK
3 3 8 3 TRUE # OK
4 4 9 0 FALSE # OK
5 5 10 2 TRUE # expected to be FALSE
Most importantly, my real df
is very large, so I would like to avoid using for-loops
. I would privilege the shortest (less amount of code) and fastest solution.
Upvotes: 2
Views: 3090
Reputation: 887851
We can use Reduce
Reduce(`|`, lapply(df[1:2], `==`, df[,3]))
#[1] TRUE TRUE TRUE FALSE FALSE
On a bigger data makes more sense
library(microbenchmark)
set.seed(24)
df <- data.frame(X1= sample(1:5, 1e6, replace=TRUE), X2 = sample(1:10, 1e6, replace=TRUE),
X3 = sample(1:10, 1e6, replace=TRUE))
f2 <- function(df) Reduce(`|`, lapply(df[1:2], `==`, df[,3]))
f3 <- function(df) with(df, X3==X1 | X3==X2)
microbenchmark(f1(df), f2(df), f3(df))
#Unit: milliseconds
# expr min lq mean median uq max neval
# f2(df) 8.191218 10.83333 23.28081 16.42744 22.26866 143.025 100
# f3(df) 8.154506 10.58878 19.17879 11.49179 22.41255 144.510 100
The apply
is slower as I thought, but the Reduce
is not as slow..
Upvotes: 1
Reputation: 39174
A solution using dplyr.
library(dplyr)
df %>%
rowwise() %>%
mutate(X4 = any(c(X1, X2) %in% X3)) %>%
ungroup()
# # A tibble: 5 x 4
# X1 X2 X3 X4
# <int> <int> <dbl> <lgl>
# 1 1 6 6.00 T
# 2 2 7 2.00 T
# 3 3 8 3.00 T
# 4 4 9 0 F
# 5 5 10 2.00 F
Upvotes: 1
Reputation: 12935
You could do this vectorized which is provably the fastest:
df$X4 <- with(df, X3==X1 | X3==X2)
benchmarking
library(microbenchmark)
df <- data.frame(X1 = 1:5, X2 = 6:10, X3 = c(6, 2, 3, 0, 2))
f1 <- function(df) apply(df[1:3], 1, FUN = anyDuplicated)>0
f2 <- function(df) Reduce(`|`, lapply(df[1:2], `==`, df[,3]))
f3 <- function(df) with(df, X3==X1 | X3==X2)
all(f1(df)==f2(df))
#[1] TRUE
all(f1(df)==f3(df))
#[1] TRUE
res <- microbenchmark(f1(df), f2(df), f3(df))
print(res, order="mean")
# Unit: microseconds
# expr min lq mean median uq max neval
# f3(df) 14.115 15.3980 17.57113 17.537 17.965 40.634 100
# f2(df) 79.130 80.8405 86.41780 85.118 88.325 124.468 100
# f1(df) 223.273 225.6255 235.95907 228.619 238.243 497.445 100
Upvotes: 2