Tom Smith
Tom Smith

Reputation: 371

R - Vectorising nested for loops?

I have 2 larges dataframes (head(CON_FRAs) and head(indels) shown below) that I wish to compare and modify values of. I come from a python background and as such have created this nested for loop which achieves my goal of finding values of indels$V4 which are between CON_FRAs$V4 and CON_FRAs$V5 and adding 1 to indels$CON and CON_FRAs$V6, just very slowly (runtime of 1 hour).

Can anyone help me vectorise this code?

on a side note, I also realise that I should not require the same if condition to be stated twice in the nested loop, however I find many things unintuitive when programing in R and could not make R accept 2 actions from just one if statement.

CON_FRAs

V1        V2        V3        V4        V5 V6
1  1  57859401  58018691  57859401  58018691  0
2  1  97522550  97892513  97522550  97892513  0
3  1 214173802 224638502 214173802 224638502  0
4  1 184035608 184239812 184035608 184239812  0
5  2 140988941 141140259 390239564 390390882  0
6  2 169205756 170181166 418456379 419431789  0

indels

V1        V2        V3         V4         V5  V6     V7  V8            V9 V10 mut_type CON BIG FUN MRA LET recomb
1  6  96651182  96651183 57859401 57859402  AA      T CLL 001-0002-03TD  NA  COMPLEX   0   0   0   0   0      0
2 10  38406960  38406961 1718780121 1718780122  AG      - CLL 003-0005-09TD  NA      DEL   0   0   0   0   0      0
3  2  87017743  87017744  336268366  336268367   C     CT CLL 003-0005-09TD  NA  COMPLEX   0   0   0   0   0      0
4 20   5538748   5538750 2724112091 2724112093 CCC      A CLL    012-02-1TD  NA  COMPLEX   0   0   0   0   0      0
5  9 139390648 139390649 1678550376 1678550377  AG      - CLL    012-02-1TD  NA      DEL   0   0   0   0   0      0
6 10  10498176  10498180 1690871337 1690871341   - GAAAAA CLL           125  NA      INS   0   0   0   0   0      0

My Nested Loop

for(j in 1:length(indels$V4)){
   for(i in 1:length(CON_FRAs$V4)){
     if(CON_FRAs$V4[i] < indels$V4[j] & indels$V4[j] < CON_FRAs$V5[i])
       indels$CON[j] = 1
     if(CON_FRAs$V4[i] < indels$V4[j] & indels$V4[j] < CON_FRAs$V5[i])
       CON_FRAs$V6[i] = CON_FRAs$V6[i] + 1}}

UPDATE: I have managed to improve on the performance using a half and half approach of placing a vectorized command within a single loop thereby negating the exponential increase of a nested loop, two loops were still required though (see below). This has reduced the runtime to under 2 mins. This will have to do me for now because it is quick enough, would still be interested if anyone could provide a fully vectorised solution

for(j in 1:length(indels$V4)){
  inc(CON_FRAs$V6[CON_FRAs$V4 < indels$V4[j] & indels$V4[j] < CON_FRAs$V5]) <- 1}

for(i in 1:length(CON_FRAs$V6)){
  indels$CON[CON_FRAs$V4[i] < indels$V4 & indels$V4 < CON_FRAs$V5[i]] <- 1}

Upvotes: 0

Views: 111

Answers (1)

IRTFM
IRTFM

Reputation: 263481

As far as your side note problem, this should solve it:

for(j in 1:length(indels$V4)){
   for(i in 1:length(CON_FRAs$V4)){
     if(CON_FRAs$V4[i] < indels$V4[j] & indels$V4[j] < CON_FRAs$V5[i]) {
       indels$CON[j] = 1
       CON_FRAs$V6[i] = CON_FRAs$V6[i] + 1}
                                 } }

I'm thinking this is really two different problems when considered from a "vectorized" perspective where a matrix was being produced for the logical tests perhaps using "outer". It would lend itself to a colSums done on the accumulation of "V6" values within values of "i" but for the "indel$CON" columns I would think it might be the result of "any" (using apply or colMax from a non-base package) on the "j"-rows:

outer(1:nrow(indels), 
      1:nrow(CON_FRAs),
       function(X,Y) {CON_FRAs$V4[X] < indels$V4[Y] & 
                      indels$V4[Y] < CON_FRAs$V5[X]} )

      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
[1,] FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE

Caveat. I was getting a different result in an earlier effort. Fixing the indexing in the logical expressions gives me the same (trivial) result as your code, but perhaps if you put in a better test case we could better compare results.

Upvotes: 1

Related Questions