JP1
JP1

Reputation: 751

Error with parsing data in R

I have tried writing the below code in order to identify numbers indside a range ie: SNP[i,1] should be less than Working[j,1] and greater than Working[j,2] to be added to a new data frame.

The SNP file is 350 lines, and Working 6500. For some reason, I end up with 10000s of lines with data which does not fit my conditional.

Is it obvious that I have something wrong here ?

for (i in 1:nrow(SNP_file)){
  for (j in 1:nrow(Working)){

if ((as.numeric(SNP_file[i,1]) >= as.numeric(Working[j,1])) && (as.numeric(SNP_file[i,1]) <= as.numeric(Working[j,2]))){
  New <- rbind(New, data.frame(Chromosome =Working[j, 1], 
                                    Start= Working[j, 2], 
                                    Stop = Working[j, 3], 
                                    GO = Working[j,4],
                                    Position = VCF[i,1],
                                    REF = SNP_file[i,2],
                                    GT = SNP_file[i,3],
                                    Site_Conf = SNP_file[i,4]
                                    ))
}}}

Thanks,

J

Upvotes: 0

Views: 100

Answers (2)

Pierre L
Pierre L

Reputation: 28441

You can avoid the for loop since the comparison is vectorized:

indx <- SNP_file[,1] >= Working[,1] & SNP_file[,1] <= Working[,2]
[1]  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE

This outputs a logical vector of the rows satisfying the condition. That is very helpful because now you can use that vector as an index for subsetting.

newdf <- cbind(SNP_file[indx,2], Working[indx,2:3])

In this case I assigned the second column of SNP and the second and third columns of Working to a new data frame. And with just the rows that met the condition.


That is just an example for clarity. Your example wasn't reproducible but try this instead:

Working <- Working[1:nrow(SNP_file),]
indx <- as.numeric(SNP_file[,1]) >= as.numeric(Working[,1]) & as.numeric(SNP_file[,1]) <= as.numeric(Working[,2])
New <- data.frame(Chromosome =Working[indx, 1], 
                  Start     = Working[indx, 2], 
                  Stop      = Working[indx, 3], 
                  GO        = Working[indx, 4],
                  Position  = VCF[indx, 1],
                  REF       = SNP_file[indx,2],
                  GT        = SNP_file[indx,3],
                  Site_Conf = SNP_file[indx,4]
                                    )

Note that the length of the two data frames are not equal. The first 350 lines of Working were compared to SNP_file. If you are comparing in a different way, you should specify that.

Data

set.seed(7)
SNP_file <- data.frame(x=sample(10), y=month.abb[1:10])
Working <- data.frame(x=sample(10), y=sample(20,10), z=(sample(LETTERS[1:10])))

Upvotes: 2

Daniel
Daniel

Reputation: 328

It is not clear from your original post what you want to achieve. How do indices i and j correspond to each other? Can you give the example of the data (input and desired output)?

The loop in your code repeats 6500 * 350 = 2275000 times, and it compares every row in SNP_file with every row in Working.

Upvotes: 0

Related Questions