Reputation: 751
I have tried writing the below code in order to identify numbers indside a range ie: SNP[i,1] should be less than Working[j,1] and greater than Working[j,2] to be added to a new data frame.
The SNP file is 350 lines, and Working 6500. For some reason, I end up with 10000s of lines with data which does not fit my conditional.
Is it obvious that I have something wrong here ?
for (i in 1:nrow(SNP_file)){
for (j in 1:nrow(Working)){
if ((as.numeric(SNP_file[i,1]) >= as.numeric(Working[j,1])) && (as.numeric(SNP_file[i,1]) <= as.numeric(Working[j,2]))){
New <- rbind(New, data.frame(Chromosome =Working[j, 1],
Start= Working[j, 2],
Stop = Working[j, 3],
GO = Working[j,4],
Position = VCF[i,1],
REF = SNP_file[i,2],
GT = SNP_file[i,3],
Site_Conf = SNP_file[i,4]
))
}}}
Thanks,
J
Upvotes: 0
Views: 100
Reputation: 28441
You can avoid the for loop since the comparison is vectorized:
indx <- SNP_file[,1] >= Working[,1] & SNP_file[,1] <= Working[,2]
[1] TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
This outputs a logical vector of the rows satisfying the condition. That is very helpful because now you can use that vector as an index for subsetting.
newdf <- cbind(SNP_file[indx,2], Working[indx,2:3])
In this case I assigned the second column of SNP
and the second and third columns of Working
to a new data frame. And with just the rows that met the condition.
That is just an example for clarity. Your example wasn't reproducible but try this instead:
Working <- Working[1:nrow(SNP_file),]
indx <- as.numeric(SNP_file[,1]) >= as.numeric(Working[,1]) & as.numeric(SNP_file[,1]) <= as.numeric(Working[,2])
New <- data.frame(Chromosome =Working[indx, 1],
Start = Working[indx, 2],
Stop = Working[indx, 3],
GO = Working[indx, 4],
Position = VCF[indx, 1],
REF = SNP_file[indx,2],
GT = SNP_file[indx,3],
Site_Conf = SNP_file[indx,4]
)
Note that the length of the two data frames are not equal. The first 350 lines of Working
were compared to SNP_file
. If you are comparing in a different way, you should specify that.
Data
set.seed(7)
SNP_file <- data.frame(x=sample(10), y=month.abb[1:10])
Working <- data.frame(x=sample(10), y=sample(20,10), z=(sample(LETTERS[1:10])))
Upvotes: 2
Reputation: 328
It is not clear from your original post what you want to achieve. How do indices i and j correspond to each other? Can you give the example of the data (input and desired output)?
The loop in your code repeats 6500 * 350 = 2275000
times, and it compares every row in SNP_file with every row in Working.
Upvotes: 0