Reputation: 4757
I am trying to "merge" column V of a Data Frame in another one if the columns X and Y are equals (I have to match dOne.X == dTwo.X & dOne.Y == dTwo.Y
and also dOne.X == dTwo.Y & dOne.Y == dTwo.X
)
I solved this using a for
loop, but it is slow when the Data Frame dOne is big (in my machine it takes 25 minutes if length(dOne.X) == 500000
). I would like to know if there is a way to solve this problem using a faster "vectorized" operation. Above is an exemple of what I want to do:
Data Frame ONE
X Y V
a b 2
a c 3
a d 0
a e 0
b c 2
b d 3
b e 0
c d 2
c e 0
d e 0
Data Frame TWO
X Y V
a b 1
a c 1
a d 1
b c 1
b d 1
c d 1
e d 1
Expected Data Frame after the columns are merged
X Y V V2
a b 2 1
a c 3 1
a d 0 1
a e 0 0
b c 2 1
b d 3 1
b e 0 0
c d 2 1
c e 0 0
d e 0 1
This is the code I am using so far that is slow when dOne is big (hundreds of thousands or rows):
copyadjlistValueColumn <- function(dOne, dTwo) {
dOne$V2 <- 0
lv <- union(levels(dOne$Y), levels(dOne$X))
dTwo$X <- factor(dTwo$X, levels = lv)
dTwo$Y <- factor(dTwo$Y, levels = lv)
dOne$X <- factor(dOne$X, levels = lv)
dOne$Y <- factor(dOne$Y, levels = lv)
for(i in 1:nrow(dTwo)) {
row <- dTwo[i,]
dOne$V2[dOne$X == row$X & dOne$Y == row$Y] <- row$V
dOne$V2[dOne$X == row$Y & dOne$Y == row$X] <- row$V
}
dOne
}
This is a testthat test case that covers what I am expecting (using the data frames above):
test_that("Copy V column to another Data Frame", {
dfOne <- data.frame(X=c("a", "a", "a", "a", "b", "b", "b", "c", "c", "d"),
Y=c("b", "c", "d", "e", "c", "d", "e", "d", "e", "e"),
V=c(2, 3, 0, 0, 2, 3, 0, 2, 0, 0))
dfTwo <- data.frame(X=c("a", "a", "a", "b", "b", "c", "e"),
Y=c("b", "c", "d", "c", "d", "d", "d"),
V=c(1, 1, 1, 1, 1, 1, 1))
lv <- union(levels(dfTwo$Y), levels(dfTwo$X))
dfExpected <- data.frame(X=c("a", "a", "a", "a", "b", "b", "b", "c", "c", "d"),
Y=c("b", "c", "d", "e", "c", "d", "e", "d", "e", "e"),
V=c(2, 3, 0, 0, 2, 3, 0, 2, 0, 0),
V2=c(1, 1, 1, 0, 1, 1, 0, 1, 0, 1))
dfExpected$X <- factor(dfExpected$X, levels = lv)
dfExpected$Y <- factor(dfExpected$Y, levels = lv)
dfMerged <- copyadjlistValueColumn(dfOne, dfTwo)
expect_identical(dfMerged, dfExpected)
})
Any suggestion?
Thanks a lot :)
Upvotes: 4
Views: 7744
Reputation: 67778
Try two merge
, where order of matching columns is reversed in the second, to get the 'bidirectional' matching. Then you may use e.g. rowSums
to collapse the two created columns to one.
d1 <- merge(dfOne, dfTwo, by.x = c("X", "Y"), by.y = c("X", "Y"), all.x = TRUE)
d2 <- merge(d1, dfTwo, by.x = c("X", "Y"), by.y = c("Y", "X"), all.x = TRUE)
cbind(dfOne, V2 = rowSums(cbind(d2$V.y, d2$V), na.rm = TRUE))
# X Y V V2
# 1 a b 2 1
# 2 a c 3 1
# 3 a d 0 1
# 4 a e 0 0
# 5 b c 2 1
# 6 b d 3 1
# 7 b e 0 0
# 8 c d 2 1
# 9 c e 0 0
# 10 d e 0 1
For faster alternatives to merge
, check data.table
and dplyr
alternatives here: stackoverflow.com/questions/1299871/how-to-join-data-frames-in-r-inner-outer-left-right/
Upvotes: 3
Reputation: 553
With dplyr
:
library(dplyr)
left_join(dfOne, dfTwo, by = c("X", "Y")) %>%
left_join(dfTwo, by = c("X" = "Y", "Y" = "X")) %>%
mutate(V2 = ifelse(is.na(V.y), V, V.y)) %>%
select(X, Y, V = V.x, V2) %>%
do(replace(., is.na(.), 0))
Upvotes: 2
Reputation: 92282
Here's a possible data.table
package approach. This approach should be particularly efficient for a big data set like you have:
First convert to data.table
object and add keys
library(data.table)
setkey(setDT(dfOne), X, Y)
setkey(setDT(dfTwo), X, Y)
Then perform a join on X & Y
combination - the join is performed by matching key columns X,Y
of dfOne
with key columns X,Y
of dfTwo
respectively.
dfOne[dfTwo, V2 := i.V]
Now perform a join on Y & X
combination - the join is performed by matching key columns X,Y
of dfOne
with key columns Y,X
of dfTwo
respectively.
setkey(dfTwo, Y, X)
dfOne[dfTwo, V2 := i.V][]
Result (I'll keep the unmatched as NA
s instead of zeroes as it makes more sense this way):
# X Y V V2
# 1: a b 2 1
# 2: a c 3 1
# 3: a d 0 1
# 4: a e 0 NA
# 5: b c 2 1
# 6: b d 3 1
# 7: b e 0 NA
# 8: c d 2 1
# 9: c e 0 NA
# 10: d e 0 1
Upvotes: 2