Reputation: 99

A more elegant way to iterate over two data.tables' rows

I'm starting using R more and more frequently, coming from C/C++. For this reason, I often find myself thinking à la C++ when working with R's data structures.

Here I have two data.tables that I have to iterate through and update the value of column 1 and column 2 in table A with the value of column 2 in table B, according to column 1 table B w.r.t. columns 1 and 2 in table A.

Sorry for this confusing description. I try to make it better

I have two data tables (the number of rows is different because they could actually be different):

TabA

     Col1     Col2
1:   TP53     CD68
2:   TP53    MPDU1
3:   TP53     PHF2
4:   TP53 KIAA0753
5:   CD68    ZBTB4
6:   CD68     CHD3
7:  MPDU1    ZBTB4
8:  MPDU1     CHD3
9:  MPDU1   SLC2A4
10: MPDU1     YBX2
11: MPDU1    AURKB
12: MPDU1 TMEM132B
13:  PHF2 C9orf129
14:  PHF2    CDH23
15:  PHF2   PTPDC1

and TabB:

       Col3 Col4
1:   ADAM32  0
2:   ADARB2  1
3:    AGBL2  2
4:   ALOX12  3
5:  ANKRD46  4
6:    APOL1  5
7:    APOOL  6
8:     ASPA  7
9:      AUH  8
10:   AURKB  9
11:   AUTS2 10
12:    BAAT 11

So basically, I want to compare Col1 and Col2 from TabA with Col3 in TabB: if they are equal substitute the string with the number in Col4 of TabB.

My approach, definitely C-style:

for(i in 1:nrow(TabA)) {
    for(j in 1:nrow(TabB)) {
        if(TabA$Col1[i] == TabB$Col3[j]) { 
            TabA$Col1[i] <- TabB$Col4[j] 
        }
        if(TabA$Col2[i] == TabB$Col3[j]) { 
            TabA$Col2[i] <- TabB$Col4[j] 
        }
    }
}

This works as expected, but I am pretty sure there is a more elegant (and more efficient) way to do that, exploiting data.table's capabilities. Does anybody have a suggestion?

Thanks

Upvotes: 0

Answers (4)

Nick Kennedy

Reputation: 12640

If you are using data.table your original problem (prior to the extensive edits which changed the question) could be done like this:

TabA <- data.table(Col1 = sample(LETTERS, 15), Col3 = rnorm(15))
TabB <- data.table(Col2 = sample(LETTERS, 15), Col4 = rnorm(15))
setkey(TabA, Col1)
setkey(TabB, Col2)
TabA
# shows TabA before changing it
TabA[TabB, Col3 := Col4]

I still find data.table's syntax a little odd because it differs from the standard behaviour in R when functions don't modify their arguments. On the other hand, it makes for concise, memory-efficient and fast code.

Solution to rewritten question

Note your sample data is unhelpful - there are no matches of TabA$Col1 in TabB$Col3. Anyway, this could be tackled in a few ways.

Using data.table := operator:

TabA[Col1 %in% TabB$Col3, Col1 := with(TabB, as.character(Col4[na.omit(match(Col1, Col3))]))]
TabA[Col2 %in% TabB$Col3, Col2 := with(TabB, as.character(Col4[na.omit(match(Col2, Col3))]))]

Using base R syntax (would work for a data.frame also):

TabA$Col1[TabA$Col1 %in% TabB$Col3] <- TabB$Col4[match(TabA$Col1[TabA$Col1 %in% TabB$Col3], TabB$Col3)]
TabA$Col2[TabA$Col2 %in% TabB$Col3] <- TabB$Col4[match(TabA$Col2[TabA$Col2 %in% TabB$Col3], TabB$Col3)]

Using setkey and joins:

TabA[, Index := 1:nrow(TabA)]
setkey(TabA, Col1)
TabA[TabB, nomatch = 0, Col1 := as.character(Col4)]
setkey(TabA, Col2)
TabA[TabB, nomatch = 0, Col2 := as.character(Col4)]
setkey(TabA, Index)
TabA[, Index := NULL]

All of these assume that some items in Col1 and Col2 won't be matched. The code could be optimised if this isn't true. This is also the reason that Col4 has to be coerced to character.

Upvotes: 5

Minh Ha Pham

Reputation: 2596

I usually use by=1:nrow(DT) when I want to iterate throw each row of datatabe

library(data.table)

TabA <- data.table(Col1=c('A', 'B', 'C', 'D'), Col2=c('B', 'A', 'D', 'C'))
TabB <- data.table(Col3=c('A', 'B', 'C'), Col4=c(1, 2, 3))

TabA[, .(Col1=as.character(ifelse(nrow(TabB[Col3==Col1]) > 0, TabB[Col3==Col1]$Col4, Col1)), 
         Col2=as.character(ifelse(nrow(TabB[Col3==Col2]) > 0, TabB[Col3==Col2]$Col4, Col2))
         ), 
     by=1:nrow(TabA)][, .(Col1, Col2)]

TabA[, `:=`(Col1=as.character(ifelse(nrow(TabB[Col3==Col1]) > 0, TabB[Col3==Col1]$Col4, Col1)), 
            Col2=as.character(ifelse(nrow(TabB[Col3==Col2]) > 0, TabB[Col3==Col2]$Col4, Col2))
            ), 
     by=1:nrow(TabA)]

Upvotes: 1

RHertel

Reputation: 23818

Here's one suggestion that uses only base R.

the_equals <- TabA$Col1[which(TabA$Col1 %in% TabB$Col3)]
nequals <- length(the_equals)
if(nequals>0) {
 idx1 <- lapply(1:nequals, function(x) which(TabA$Col1==the_equals[x]))
 idx2 <- lapply(1:nequals, function(x) which(TabB$Col3==the_equals[x]))
 Col2_new <- sapply(1:nequals, function(x) TabB$Col4[idx2[[x]][1]])
 for(i in 1:nequals) TabA$Col2[idx1[[i]]] <- Col2_new[i]
}

Hope this helps.

Upvotes: 0

Andrelrms

Reputation: 819

Since I don't have your data I made a little example but I think this what you are looking for:

x<- c("a","b","c","d") 
y<-c("a","d","e","f")
z<-c("z1","z2","z3","z4")
x[x %in% y]<-z[x %in% y]

Upvotes: 0

A more elegant way to iterate over two data.tables&#39; rows

Answers (4)

Solution to rewritten question

Related Questions

A more elegant way to iterate over two data.tables' rows