mat
mat

Reputation: 2617

For each row check if value in one column exists in two other columns

Assume we have the following data frame:

df <- data.frame(X1 = 1:5, X2 = 6:10, X3 = c(6, 2, 3, 0, 2))

  X1 X2 X3
1  1  6  6
2  2  7  2
3  3  8  3
4  4  9  0
5  5 10  2

I would like to add a new column (X4) made of logical values. For each row: if X3 is equal to X1 or X2, then X4 should be TRUE, otherwise FALSE.

I tried:

mutate(df, X4 = X3 %in% c(X2, X1))

  X1 X2 X3    X4
1  1  6  6  TRUE # OK
2  2  7  2  TRUE # OK
3  3  8  3  TRUE # OK
4  4  9  0 FALSE # OK
5  5 10  2  TRUE # expected to be FALSE

Most importantly, my real df is very large, so I would like to avoid using for-loops. I would privilege the shortest (less amount of code) and fastest solution.

Upvotes: 2

Views: 3090

Answers (3)

akrun
akrun

Reputation: 887851

We can use Reduce

Reduce(`|`, lapply(df[1:2], `==`, df[,3]))
#[1]  TRUE  TRUE  TRUE FALSE FALSE

Benchmarking

On a bigger data makes more sense

library(microbenchmark)
set.seed(24)
df <- data.frame(X1= sample(1:5, 1e6, replace=TRUE), X2 = sample(1:10, 1e6, replace=TRUE),
       X3 = sample(1:10, 1e6, replace=TRUE))

f2 <- function(df) Reduce(`|`, lapply(df[1:2], `==`, df[,3]))
f3 <- function(df) with(df, X3==X1 | X3==X2)
microbenchmark(f1(df), f2(df), f3(df))
#Unit: milliseconds
#   expr         min         lq       mean     median         uq      max neval

# f2(df)    8.191218   10.83333   23.28081   16.42744   22.26866  143.025   100
# f3(df)    8.154506   10.58878   19.17879   11.49179   22.41255  144.510   100

The apply is slower as I thought, but the Reduce is not as slow..

Upvotes: 1

www
www

Reputation: 39174

A solution using .

library(dplyr)

df %>%
  rowwise() %>%
  mutate(X4 = any(c(X1, X2) %in% X3)) %>%
  ungroup()

# # A tibble: 5 x 4
#      X1    X2    X3 X4   
#   <int> <int> <dbl> <lgl>
# 1     1     6  6.00 T    
# 2     2     7  2.00 T    
# 3     3     8  3.00 T    
# 4     4     9  0    F    
# 5     5    10  2.00 F  

Upvotes: 1

989
989

Reputation: 12935

You could do this vectorized which is provably the fastest:

df$X4 <- with(df, X3==X1 | X3==X2)

benchmarking

library(microbenchmark)
df <- data.frame(X1 = 1:5, X2 = 6:10, X3 = c(6, 2, 3, 0, 2))

f1 <- function(df) apply(df[1:3], 1, FUN = anyDuplicated)>0
f2 <- function(df) Reduce(`|`, lapply(df[1:2], `==`, df[,3]))
f3 <- function(df) with(df, X3==X1 | X3==X2)

all(f1(df)==f2(df))
#[1] TRUE
all(f1(df)==f3(df))
#[1] TRUE

res <- microbenchmark(f1(df), f2(df), f3(df))

print(res, order="mean")

# Unit: microseconds
   # expr     min       lq      mean  median      uq     max neval
 # f3(df)  14.115  15.3980  17.57113  17.537  17.965  40.634   100
 # f2(df)  79.130  80.8405  86.41780  85.118  88.325 124.468   100
 # f1(df) 223.273 225.6255 235.95907 228.619 238.243 497.445   100

Upvotes: 2

Related Questions