Reputation: 371
I have 2 larges dataframes (head(CON_FRAs) and head(indels) shown below) that I wish to compare and modify values of. I come from a python background and as such have created this nested for loop which achieves my goal of finding values of indels$V4 which are between CON_FRAs$V4 and CON_FRAs$V5 and adding 1 to indels$CON and CON_FRAs$V6, just very slowly (runtime of 1 hour).
Can anyone help me vectorise this code?
on a side note, I also realise that I should not require the same if condition to be stated twice in the nested loop, however I find many things unintuitive when programing in R and could not make R accept 2 actions from just one if statement.
CON_FRAs
V1 V2 V3 V4 V5 V6
1 1 57859401 58018691 57859401 58018691 0
2 1 97522550 97892513 97522550 97892513 0
3 1 214173802 224638502 214173802 224638502 0
4 1 184035608 184239812 184035608 184239812 0
5 2 140988941 141140259 390239564 390390882 0
6 2 169205756 170181166 418456379 419431789 0
indels
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 mut_type CON BIG FUN MRA LET recomb
1 6 96651182 96651183 57859401 57859402 AA T CLL 001-0002-03TD NA COMPLEX 0 0 0 0 0 0
2 10 38406960 38406961 1718780121 1718780122 AG - CLL 003-0005-09TD NA DEL 0 0 0 0 0 0
3 2 87017743 87017744 336268366 336268367 C CT CLL 003-0005-09TD NA COMPLEX 0 0 0 0 0 0
4 20 5538748 5538750 2724112091 2724112093 CCC A CLL 012-02-1TD NA COMPLEX 0 0 0 0 0 0
5 9 139390648 139390649 1678550376 1678550377 AG - CLL 012-02-1TD NA DEL 0 0 0 0 0 0
6 10 10498176 10498180 1690871337 1690871341 - GAAAAA CLL 125 NA INS 0 0 0 0 0 0
My Nested Loop
for(j in 1:length(indels$V4)){
for(i in 1:length(CON_FRAs$V4)){
if(CON_FRAs$V4[i] < indels$V4[j] & indels$V4[j] < CON_FRAs$V5[i])
indels$CON[j] = 1
if(CON_FRAs$V4[i] < indels$V4[j] & indels$V4[j] < CON_FRAs$V5[i])
CON_FRAs$V6[i] = CON_FRAs$V6[i] + 1}}
UPDATE: I have managed to improve on the performance using a half and half approach of placing a vectorized command within a single loop thereby negating the exponential increase of a nested loop, two loops were still required though (see below). This has reduced the runtime to under 2 mins. This will have to do me for now because it is quick enough, would still be interested if anyone could provide a fully vectorised solution
for(j in 1:length(indels$V4)){
inc(CON_FRAs$V6[CON_FRAs$V4 < indels$V4[j] & indels$V4[j] < CON_FRAs$V5]) <- 1}
for(i in 1:length(CON_FRAs$V6)){
indels$CON[CON_FRAs$V4[i] < indels$V4 & indels$V4 < CON_FRAs$V5[i]] <- 1}
Upvotes: 0
Views: 111
Reputation: 263481
As far as your side note problem, this should solve it:
for(j in 1:length(indels$V4)){
for(i in 1:length(CON_FRAs$V4)){
if(CON_FRAs$V4[i] < indels$V4[j] & indels$V4[j] < CON_FRAs$V5[i]) {
indels$CON[j] = 1
CON_FRAs$V6[i] = CON_FRAs$V6[i] + 1}
} }
I'm thinking this is really two different problems when considered from a "vectorized" perspective where a matrix was being produced for the logical tests perhaps using "outer". It would lend itself to a colSums done on the accumulation of "V6" values within values of "i" but for the "indel$CON" columns I would think it might be the result of "any" (using apply
or colMax
from a non-base package) on the "j"-rows:
outer(1:nrow(indels),
1:nrow(CON_FRAs),
function(X,Y) {CON_FRAs$V4[X] < indels$V4[Y] &
indels$V4[Y] < CON_FRAs$V5[X]} )
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE
Caveat. I was getting a different result in an earlier effort. Fixing the indexing in the logical expressions gives me the same (trivial) result as your code, but perhaps if you put in a better test case we could better compare results.
Upvotes: 1