Reputation: 99
I'm starting using R more and more frequently, coming from C/C++. For this reason, I often find myself thinking à la C++ when working with R's data structures.
Here I have two data.tables that I have to iterate through and update the value of column 1 and column 2 in table A with the value of column 2 in table B, according to column 1 table B w.r.t. columns 1 and 2 in table A.
Sorry for this confusing description. I try to make it better
I have two data tables (the number of rows is different because they could actually be different):
TabA
Col1 Col2
1: TP53 CD68
2: TP53 MPDU1
3: TP53 PHF2
4: TP53 KIAA0753
5: CD68 ZBTB4
6: CD68 CHD3
7: MPDU1 ZBTB4
8: MPDU1 CHD3
9: MPDU1 SLC2A4
10: MPDU1 YBX2
11: MPDU1 AURKB
12: MPDU1 TMEM132B
13: PHF2 C9orf129
14: PHF2 CDH23
15: PHF2 PTPDC1
and TabB:
Col3 Col4
1: ADAM32 0
2: ADARB2 1
3: AGBL2 2
4: ALOX12 3
5: ANKRD46 4
6: APOL1 5
7: APOOL 6
8: ASPA 7
9: AUH 8
10: AURKB 9
11: AUTS2 10
12: BAAT 11
So basically, I want to compare Col1 and Col2 from TabA with Col3 in TabB: if they are equal substitute the string with the number in Col4 of TabB.
My approach, definitely C-style:
for(i in 1:nrow(TabA)) {
for(j in 1:nrow(TabB)) {
if(TabA$Col1[i] == TabB$Col3[j]) {
TabA$Col1[i] <- TabB$Col4[j]
}
if(TabA$Col2[i] == TabB$Col3[j]) {
TabA$Col2[i] <- TabB$Col4[j]
}
}
}
This works as expected, but I am pretty sure there is a more elegant (and more efficient) way to do that, exploiting data.table's capabilities. Does anybody have a suggestion?
Thanks
Upvotes: 0
Views: 529
Reputation: 12640
If you are using data.table
your original problem (prior to the extensive edits which changed the question) could be done like this:
TabA <- data.table(Col1 = sample(LETTERS, 15), Col3 = rnorm(15))
TabB <- data.table(Col2 = sample(LETTERS, 15), Col4 = rnorm(15))
setkey(TabA, Col1)
setkey(TabB, Col2)
TabA
# shows TabA before changing it
TabA[TabB, Col3 := Col4]
I still find data.table
's syntax a little odd because it differs from the standard behaviour in R when functions don't modify their arguments. On the other hand, it makes for concise, memory-efficient and fast code.
Note your sample data is unhelpful - there are no matches of TabA$Col1
in TabB$Col3
. Anyway, this could be tackled in a few ways.
Using data.table
:=
operator:
TabA[Col1 %in% TabB$Col3, Col1 := with(TabB, as.character(Col4[na.omit(match(Col1, Col3))]))]
TabA[Col2 %in% TabB$Col3, Col2 := with(TabB, as.character(Col4[na.omit(match(Col2, Col3))]))]
Using base R syntax (would work for a data.frame
also):
TabA$Col1[TabA$Col1 %in% TabB$Col3] <- TabB$Col4[match(TabA$Col1[TabA$Col1 %in% TabB$Col3], TabB$Col3)]
TabA$Col2[TabA$Col2 %in% TabB$Col3] <- TabB$Col4[match(TabA$Col2[TabA$Col2 %in% TabB$Col3], TabB$Col3)]
Using setkey and joins:
TabA[, Index := 1:nrow(TabA)]
setkey(TabA, Col1)
TabA[TabB, nomatch = 0, Col1 := as.character(Col4)]
setkey(TabA, Col2)
TabA[TabB, nomatch = 0, Col2 := as.character(Col4)]
setkey(TabA, Index)
TabA[, Index := NULL]
All of these assume that some items in Col1 and Col2 won't be matched. The code could be optimised if this isn't true. This is also the reason that Col4 has to be coerced to character.
Upvotes: 5
Reputation: 2596
I usually use by=1:nrow(DT)
when I want to iterate throw each row of datatabe
library(data.table)
TabA <- data.table(Col1=c('A', 'B', 'C', 'D'), Col2=c('B', 'A', 'D', 'C'))
TabB <- data.table(Col3=c('A', 'B', 'C'), Col4=c(1, 2, 3))
TabA[, .(Col1=as.character(ifelse(nrow(TabB[Col3==Col1]) > 0, TabB[Col3==Col1]$Col4, Col1)),
Col2=as.character(ifelse(nrow(TabB[Col3==Col2]) > 0, TabB[Col3==Col2]$Col4, Col2))
),
by=1:nrow(TabA)][, .(Col1, Col2)]
or
TabA[, `:=`(Col1=as.character(ifelse(nrow(TabB[Col3==Col1]) > 0, TabB[Col3==Col1]$Col4, Col1)),
Col2=as.character(ifelse(nrow(TabB[Col3==Col2]) > 0, TabB[Col3==Col2]$Col4, Col2))
),
by=1:nrow(TabA)]
Upvotes: 1
Reputation: 23818
Here's one suggestion that uses only base R
.
the_equals <- TabA$Col1[which(TabA$Col1 %in% TabB$Col3)]
nequals <- length(the_equals)
if(nequals>0) {
idx1 <- lapply(1:nequals, function(x) which(TabA$Col1==the_equals[x]))
idx2 <- lapply(1:nequals, function(x) which(TabB$Col3==the_equals[x]))
Col2_new <- sapply(1:nequals, function(x) TabB$Col4[idx2[[x]][1]])
for(i in 1:nequals) TabA$Col2[idx1[[i]]] <- Col2_new[i]
}
Hope this helps.
Upvotes: 0
Reputation: 819
Since I don't have your data I made a little example but I think this what you are looking for:
x<- c("a","b","c","d")
y<-c("a","d","e","f")
z<-c("z1","z2","z3","z4")
x[x %in% y]<-z[x %in% y]
Upvotes: 0