Reputation: 87
I have a data that contains two sets of variables, and I want to compare whether the two sets have the same value. In each line of the two sets of variables, as long as any pair of values are equal, it is marked as 1, otherwise it is 0. If the data contains missing values, I want the missing values not to participate in the comparison. If the data contains a character variable, they are still considered equal as long as its actual value is the same as that of a numeric variable.
To illustrate the problem, I generate data a1. I want to determine whether any values in the first group of variables (z1 and x1) and the second group of variables (z2 and x2) are same and generate the variable result.
a1=data.table(z1=c(1,NA,3:5),x1=c("3",4:7),z2=c(2,NA,4:6),x2=c(3,5,4,7,5))
a1$result=c(1,0,0,0,1)
The actual data is close to 20 million lines, and there are many variables in each group. I want to find the most efficient method. Thanks a lot!
Upvotes: 2
Views: 1046
Reputation: 160952
This is definitely a problem that scales poorly. After some benchmarking of (say) as.matrix
, apply
, asplit
, data.table::transpose
, etc, I haven't found one that scales reasonably beyond 50K rows.
The most direct (and, to me, palatable, performance-wise) path is the most literal:
a1[, result := +(z1 == z2 | z1 == x2 | x1 == z2 | x1 == x2)]
This fails with NA
values, though, so we need to be a little more careful. After some playing, I think this helper function is the most direct, since it does exactly the logic we need and is fully vectorized:
`%=%` <- function(a, b) !is.na(a) & !is.na(b) & a == b
a1[, +(z1 %=% z2 | x1 %=% z2 | z1 %=% x2 | x1 %=% x2)]
# [1] 1 0 0 0 1
(I intentionally avoided using `%==%`
since I've seen that in other packages in a way that allows NA %==% NA
to be true. If you prefer to use `%==%`
, feel free, or use some other infix operator of your choosing. It doesn't even need to be infix, that's mostly aesthetic.)
The question is how to automate this when we have more variables within each group (as defined by the trailing number in the variable name). For that, I propose we literally create the expression by hand and then eval/parse it.
g1 = grep("1", names(a1), value = TRUE)
g2 = grep("2", names(a1), value = TRUE)
expr <- paste0(
"+(",
paste(outer(g1, g2, function(a, b) sprintf("%s %%=%% %s", a, b)), collapse = " | "),
")")
expr
# [1] "+(z1 %=% z2 | x1 %=% z2 | z1 %=% x2 | x1 %=% x2)"
This produces the desired results:
a1[, result2 := eval(parse(text = expr))]
# z1 x1 z2 x2 result result2
# <num> <char> <num> <num> <num> <int>
# 1: 1 3 2 3 1 1
# 2: NA 4 NA 5 0 0
# 3: 3 5 4 4 0 0
# 4: 4 6 5 7 0 0
# 5: 5 7 6 5 1 1
This scales well vertically. If a1
is 5 rows, then replicating it 1e4
times produces 50K rows, etc.
a1e4 <- rbindlist(replicate(1e4, a1, simplify=FALSE)) # 50K rows
system.time(a1e4[, result2 := eval(parse(text = expr))])
# user system elapsed
# 0.06 0.00 0.06
a1e5 <- rbindlist(replicate(1e5, a1, simplify=FALSE)) # 500K
system.time(a1e5[, result2 := eval(parse(text = expr))])
# user system elapsed
# 0.7 0.0 0.7
a1e6 <- rbindlist(replicate(1e6, a1, simplify=FALSE)) # 5M
system.time(a1e6[, result2 := eval(parse(text = expr))])
# user system elapsed
# 7.16 0.06 7.22
It appears to scale linearly, which means that another 4x of rows should resolve in around 30 seconds.
What about with more variables per group? (i.e., scaling horizontally)
set.seed(42)
b1 <- copy(a1[,1:4])[, c("s1","t1","u1","v1","w1","y1", "s2","t2","u2","v2","w2","y2") :=
replicate(12, sample(9, .N, replace = TRUE), simplify = FALSE)]
b1
# z1 x1 z2 x2 s1 t1 u1 v1 w1 y1 s2 t2 u2 v2 w2 y2
# <num> <char> <num> <num> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1: 1 3 2 3 1 2 9 9 4 8 6 8 1 2 2 1
# 2: NA 4 NA 5 5 1 5 9 2 6 2 2 5 4 7 1
# 3: 3 5 4 4 1 8 4 4 8 8 5 3 2 3 6 7
# 4: 4 6 5 7 9 7 2 5 3 4 4 8 6 6 8 4
# 5: 5 7 6 5 4 4 3 5 1 4 2 7 6 5 5 9
bg1 = grep("1", names(b1), value = TRUE)
bg2 = grep("2", names(b1), value = TRUE)
bexpr <- paste0(
"+(",
paste(outer(bg1, bg2, function(a, b) sprintf("%s %%=%% %s", a, b)), collapse = " | "),
")")
bexpr
# [1] "+(z1 %=% z2 | x1 %=% z2 | s1 %=% z2 | t1 %=% z2 | u1 %=% z2 | v1 %=% z2 | w1 %=% z2 | y1 %=% z2 | z1 %=% x2 | x1 %=% x2 | s1 %=% x2 | t1 %=% x2 | u1 %=% x2 | v1 %=% x2 | w1 %=% x2 | y1 %=% x2 | z1 %=% s2 | x1 %=% s2 | s1 %=% s2 | t1 %=% s2 | u1 %=% s2 | v1 %=% s2 | w1 %=% s2 | y1 %=% s2 | z1 %=% t2 | x1 %=% t2 | s1 %=% t2 | t1 %=% t2 | u1 %=% t2 | v1 %=% t2 | w1 %=% t2 | y1 %=% t2 | z1 %=% u2 | x1 %=% u2 | s1 %=% u2 | t1 %=% u2 | u1 %=% u2 | v1 %=% u2 | w1 %=% u2 | y1 %=% u2 | z1 %=% v2 | x1 %=% v2 | s1 %=% v2 | t1 %=% v2 | u1 %=% v2 | v1 %=% v2 | w1 %=% v2 | y1 %=% v2 | z1 %=% w2 | x1 %=% w2 | s1 %=% w2 | t1 %=% w2 | u1 %=% w2 | v1 %=% w2 | w1 %=% w2 | y1 %=% w2 | z1 %=% y2 | x1 %=% y2 | s1 %=% y2 | t1 %=% y2 | u1 %=% y2 | v1 %=% y2 | w1 %=% y2 | y1 %=% y2)"
Ugh, that looks bad, but the performance scales very nicely with 8 variables per group:
b1e4 <- rbindlist(replicate(1e4, b1, simplify=FALSE))
system.time(b1e4[, result2 := eval(parse(text = bexpr))])
# user system elapsed
# 0.11 0.00 0.10
b1e5 <- rbindlist(replicate(1e5, b1, simplify=FALSE))
system.time(b1e5[, result2 := eval(parse(text = bexpr))])
# user system elapsed
# 1.03 0.00 1.03
b1e6 <- rbindlist(replicate(1e6, b1, simplify=FALSE))
system.time(b1e6[, result2 := eval(parse(text = bexpr))])
# user system elapsed
# 11.72 0.51 12.25
Upvotes: 1
Reputation: 66819
The actual data is close to 20 million lines, and there are many variables in each group. I want to find the most efficient method
You can transform to long form and join to see if there are any matches. I guess it's relatively fast.
# this code should work for the original question (without character vectors or NAs)
# create a row id
a1[, row_id := .I]
# specify column groups
cols1 = c("x1", "z1")
cols2 = c("x2", "z2")
# transform to long form, drop colnames, drop dupes
longDT1 = unique(melt(a1[, c("row_id", ..cols1)], id.vars="row_id")[, !"variable"])
longDT2 = unique(melt(a1[, c("row_id", ..cols2)], id.vars="row_id")[, !"variable"])
# find any matches
w = longDT1[longDT2, on=.(row_id, value), which=TRUE, nomatch=0]
# find associated row_ids
match_row_ids = longDT1[w, unique(row_id)]
# flag rows
a1[, res := FALSE][match_row_ids, res := TRUE]
Note: If you have a mix of character values in some columns:
type.convert
as part of data cleaning to get the correct type from the start.value
columns in longDT1 and longDT2 must be converted to strings.Upvotes: 2
Reputation: 146164
Here's another generalizable method relying on the column names for each group:
g1 = grep("1", names(a1), value = TRUE)
g2 = grep("2", names(a1), value = TRUE)
a1[, result := as.integer(
apply(.SD, MARGIN = 1, FUN = function(x) any(x[g1] %in% x[g2]))
), .SDcols = c(g1, g2)]
a1
# z1 x1 z2 x2 result
# 1: 1 3 2 3 1
# 2: 2 4 3 5 0
# 3: 3 5 4 4 0
# 4: 4 6 5 7 0
# 5: 5 7 6 5 1
Upvotes: 3
Reputation: 887901
We may loop over the rows, find the length
of the intersect
between the pairs and convert to logical
library(data.table)
a1[, result := +(apply(.SD, 1, FUN = function(x)
length(intersect(x[1:2], x[3:4]))) > 0)]
-output
> a1
z1 x1 z2 x2 result
1: 1 3 2 3 1
2: 2 4 3 5 0
3: 3 5 4 4 0
4: 4 6 5 7 0
5: 5 7 6 5 1
With respect to efficiency, dapply
(from collapse
) may be faster compared to apply
library(collapse)
a1[, result := dapply(.SD, MARGIN = 1, FUN = function(x)
length(intersect(x[1:2], x[3:4])))]
Or use a vectorized option with str_detect
library(stringr)
a1[, result := +(str_detect(paste(z1, x1), paste0(z2, "|", x2)))]
Upvotes: 2