Vesper
Vesper

Reputation: 87

How to determine whether two sets of variables have a shared value in R?

I have a data that contains two sets of variables, and I want to compare whether the two sets have the same value. In each line of the two sets of variables, as long as any pair of values are equal, it is marked as 1, otherwise it is 0. If the data contains missing values, I want the missing values not to participate in the comparison. If the data contains a character variable, they are still considered equal as long as its actual value is the same as that of a numeric variable.

To illustrate the problem, I generate data a1. I want to determine whether any values in the first group of variables (z1 and x1) and the second group of variables (z2 and x2) are same and generate the variable result.

a1=data.table(z1=c(1,NA,3:5),x1=c("3",4:7),z2=c(2,NA,4:6),x2=c(3,5,4,7,5))
a1$result=c(1,0,0,0,1)

The actual data is close to 20 million lines, and there are many variables in each group. I want to find the most efficient method. Thanks a lot!

Upvotes: 2

Views: 1046

Answers (4)

r2evans
r2evans

Reputation: 160952

This is definitely a problem that scales poorly. After some benchmarking of (say) as.matrix, apply, asplit, data.table::transpose, etc, I haven't found one that scales reasonably beyond 50K rows.

The most direct (and, to me, palatable, performance-wise) path is the most literal:

a1[, result := +(z1 == z2 | z1 == x2 | x1 == z2 | x1 == x2)]

This fails with NA values, though, so we need to be a little more careful. After some playing, I think this helper function is the most direct, since it does exactly the logic we need and is fully vectorized:

`%=%` <- function(a, b) !is.na(a) & !is.na(b) & a == b
a1[, +(z1 %=% z2 | x1 %=% z2 | z1 %=% x2 | x1 %=% x2)]
# [1] 1 0 0 0 1

(I intentionally avoided using `%==%` since I've seen that in other packages in a way that allows NA %==% NA to be true. If you prefer to use `%==%`, feel free, or use some other infix operator of your choosing. It doesn't even need to be infix, that's mostly aesthetic.)

The question is how to automate this when we have more variables within each group (as defined by the trailing number in the variable name). For that, I propose we literally create the expression by hand and then eval/parse it.

g1 = grep("1", names(a1), value = TRUE)
g2 = grep("2", names(a1), value = TRUE)
expr <- paste0(
  "+(",
  paste(outer(g1, g2, function(a, b) sprintf("%s %%=%% %s", a, b)), collapse = " | "),
  ")")
expr
# [1] "+(z1 %=% z2 | x1 %=% z2 | z1 %=% x2 | x1 %=% x2)"

This produces the desired results:

a1[, result2 := eval(parse(text = expr))]
#       z1     x1    z2    x2 result result2
#    <num> <char> <num> <num>  <num>   <int>
# 1:     1      3     2     3      1       1
# 2:    NA      4    NA     5      0       0
# 3:     3      5     4     4      0       0
# 4:     4      6     5     7      0       0
# 5:     5      7     6     5      1       1

This scales well vertically. If a1 is 5 rows, then replicating it 1e4 times produces 50K rows, etc.

a1e4 <- rbindlist(replicate(1e4, a1, simplify=FALSE))    # 50K rows
system.time(a1e4[, result2 := eval(parse(text = expr))])
#    user  system elapsed 
#    0.06    0.00    0.06 

a1e5 <- rbindlist(replicate(1e5, a1, simplify=FALSE))    # 500K
system.time(a1e5[, result2 := eval(parse(text = expr))])
#    user  system elapsed 
#     0.7     0.0     0.7 

a1e6 <- rbindlist(replicate(1e6, a1, simplify=FALSE))    # 5M
system.time(a1e6[, result2 := eval(parse(text = expr))])
#    user  system elapsed 
#    7.16    0.06    7.22 

It appears to scale linearly, which means that another 4x of rows should resolve in around 30 seconds.

What about with more variables per group? (i.e., scaling horizontally)

set.seed(42)
b1 <- copy(a1[,1:4])[, c("s1","t1","u1","v1","w1","y1", "s2","t2","u2","v2","w2","y2") :=
                         replicate(12, sample(9, .N, replace = TRUE), simplify = FALSE)]
b1
#       z1     x1    z2    x2    s1    t1    u1    v1    w1    y1    s2    t2    u2    v2    w2    y2
#    <num> <char> <num> <num> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1:     1      3     2     3     1     2     9     9     4     8     6     8     1     2     2     1
# 2:    NA      4    NA     5     5     1     5     9     2     6     2     2     5     4     7     1
# 3:     3      5     4     4     1     8     4     4     8     8     5     3     2     3     6     7
# 4:     4      6     5     7     9     7     2     5     3     4     4     8     6     6     8     4
# 5:     5      7     6     5     4     4     3     5     1     4     2     7     6     5     5     9
bg1 = grep("1", names(b1), value = TRUE)
bg2 = grep("2", names(b1), value = TRUE)
bexpr <- paste0(
  "+(",
  paste(outer(bg1, bg2, function(a, b) sprintf("%s %%=%% %s", a, b)), collapse = " | "),
  ")")
bexpr
# [1] "+(z1 %=% z2 | x1 %=% z2 | s1 %=% z2 | t1 %=% z2 | u1 %=% z2 | v1 %=% z2 | w1 %=% z2 | y1 %=% z2 | z1 %=% x2 | x1 %=% x2 | s1 %=% x2 | t1 %=% x2 | u1 %=% x2 | v1 %=% x2 | w1 %=% x2 | y1 %=% x2 | z1 %=% s2 | x1 %=% s2 | s1 %=% s2 | t1 %=% s2 | u1 %=% s2 | v1 %=% s2 | w1 %=% s2 | y1 %=% s2 | z1 %=% t2 | x1 %=% t2 | s1 %=% t2 | t1 %=% t2 | u1 %=% t2 | v1 %=% t2 | w1 %=% t2 | y1 %=% t2 | z1 %=% u2 | x1 %=% u2 | s1 %=% u2 | t1 %=% u2 | u1 %=% u2 | v1 %=% u2 | w1 %=% u2 | y1 %=% u2 | z1 %=% v2 | x1 %=% v2 | s1 %=% v2 | t1 %=% v2 | u1 %=% v2 | v1 %=% v2 | w1 %=% v2 | y1 %=% v2 | z1 %=% w2 | x1 %=% w2 | s1 %=% w2 | t1 %=% w2 | u1 %=% w2 | v1 %=% w2 | w1 %=% w2 | y1 %=% w2 | z1 %=% y2 | x1 %=% y2 | s1 %=% y2 | t1 %=% y2 | u1 %=% y2 | v1 %=% y2 | w1 %=% y2 | y1 %=% y2)"

Ugh, that looks bad, but the performance scales very nicely with 8 variables per group:

b1e4 <- rbindlist(replicate(1e4, b1, simplify=FALSE))
system.time(b1e4[, result2 := eval(parse(text = bexpr))])
#    user  system elapsed 
#    0.11    0.00    0.10 

b1e5 <- rbindlist(replicate(1e5, b1, simplify=FALSE))
system.time(b1e5[, result2 := eval(parse(text = bexpr))])
#    user  system elapsed 
#    1.03    0.00    1.03 

b1e6 <- rbindlist(replicate(1e6, b1, simplify=FALSE))
system.time(b1e6[, result2 := eval(parse(text = bexpr))])
#    user  system elapsed 
#   11.72    0.51   12.25 

Upvotes: 1

Frank
Frank

Reputation: 66819

The actual data is close to 20 million lines, and there are many variables in each group. I want to find the most efficient method

You can transform to long form and join to see if there are any matches. I guess it's relatively fast.

# this code should work for the original question (without character vectors or NAs)
# create a row id
a1[, row_id := .I]

# specify column groups
cols1 = c("x1", "z1")
cols2 = c("x2", "z2")

# transform to long form, drop colnames, drop dupes
longDT1 = unique(melt(a1[, c("row_id", ..cols1)], id.vars="row_id")[, !"variable"])
longDT2 = unique(melt(a1[, c("row_id", ..cols2)], id.vars="row_id")[, !"variable"])

# find any matches
w = longDT1[longDT2, on=.(row_id, value), which=TRUE, nomatch=0]

# find associated row_ids
match_row_ids = longDT1[w, unique(row_id)]

# flag rows
a1[, res := FALSE][match_row_ids, res := TRUE]

Note: If you have a mix of character values in some columns:

  • You can use type.convert as part of data cleaning to get the correct type from the start.
  • If you must have strings, then both value columns in longDT1 and longDT2 must be converted to strings.

Upvotes: 2

Gregor Thomas
Gregor Thomas

Reputation: 146164

Here's another generalizable method relying on the column names for each group:

g1 = grep("1", names(a1), value = TRUE)
g2 = grep("2", names(a1), value = TRUE)

a1[, result := as.integer(
  apply(.SD, MARGIN = 1, FUN = function(x) any(x[g1] %in% x[g2]))
), .SDcols = c(g1, g2)]
a1
#    z1 x1 z2 x2 result
# 1:  1  3  2  3      1
# 2:  2  4  3  5      0
# 3:  3  5  4  4      0
# 4:  4  6  5  7      0
# 5:  5  7  6  5      1

Upvotes: 3

akrun
akrun

Reputation: 887901

We may loop over the rows, find the length of the intersect between the pairs and convert to logical

library(data.table)
a1[, result := +(apply(.SD, 1, FUN = function(x) 
     length(intersect(x[1:2], x[3:4]))) > 0)]

-output

> a1
   z1 x1 z2 x2 result
1:  1  3  2  3      1
2:  2  4  3  5      0
3:  3  5  4  4      0
4:  4  6  5  7      0
5:  5  7  6  5      1

With respect to efficiency, dapply (from collapse) may be faster compared to apply

library(collapse)
a1[, result := dapply(.SD, MARGIN = 1, FUN = function(x)
          length(intersect(x[1:2], x[3:4])))]

Or use a vectorized option with str_detect

library(stringr)
a1[, result := +(str_detect(paste(z1, x1), paste0(z2, "|", x2)))]

Upvotes: 2

Related Questions