user2380782
user2380782

Reputation: 1446

subset recursively a data.frame

I have a data frame with close to a 4 million of rows in it. I need an efficient to way to subset the data based on two criteria. I can do this is a for loop but was wondering if there is a more elegant way to do this, and obviously more efficient. The data.frame looks like this:

SNP         CHR     BP          P
rs1000000   chr1    126890980   0.000007
rs10000010  chr4    21618674    0.262098    
rs10000012  chr4    1357325     0.344192
rs10000013  chr4    37225069    0.726325    
rs10000017  chr4    84778125    0.204275    
rs10000023  chr4    95733906    0.701778
rs10000029  chr4    138685624   0.260899
rs1000002   chr3    183635768   0.779574
rs10000030  chr4    103374154   0.964166    
rs10000033  chr2    139599898   0.111846    
rs10000036  chr4    139219262   0.564791
rs10000037  chr4    38924330    0.392908    
rs10000038  chr4    189176035   0.971481    
rs1000003   chr3    98342907    0.000004
rs10000041  chr3    165621955   0.573376
rs10000042  chr3    5237152     0.834206    
rs10000056  chr4    189321617   0.268479
rs1000005   chr1    34433051    0.764046
rs10000062  chr4    5254744     0.238011    
rs10000064  chr4    127809621   0.000044
rs10000068  chr2    36924287    0.000003
rs10000075  chr4    179488911   0.100225    
rs10000076  chr4    183288360   0.962476
rs1000007   chr2    237752054   0.594928
rs10000081  chr1    17348363    0.517486    
rs10000082  chr1    167310192   0.261577    
rs10000088  chr1    182605350   0.649975
rs10000092  chr4    21895517    0.000005
rs10000100  chr4    19510493    0.296693    

The first I need to do is to select those SNP with a P value lower than a threshold, then order this subset by CHR and POS. This is the easy part, using subset and order. However, the next step is the tricky one. Once I have this subset, I need to fetch all the SNP that fall into a 500,000 window up and down from the significant SNP, this step will define a region. I need to do it for all the significant SNP and store each region into a list or something similar to carry out further analysis. For example, in the displayed data frame the most significant SNP (i.e below a threshold of 0.001) for CHR==chr1 is rs1000000 and for CHR==chr4 is rs10000092. Thus these two SNP would define two regions and I need to fetch in each of these regions the SNPs that fall into a region of 500,000 up and down from the POS of each of the most significant SNP.

I know it a bit complicated, right now, I am doing the tricky part by hand but it takes a long time to do it. Any help would be appreciated.

Upvotes: 0

Views: 397

Answers (1)

rafa.pereira
rafa.pereira

Reputation: 13827

Here is a partial solution ir R using data.table, which is probably the fastest way to go in R when dealing with large datasets.

library(data.table) # v1.9.7 (devel version)


df <- fread("C:/folderpath/data.csv") # load your data
setDT(df) # convert your dataset into data.table

1st step

# Filter data under threshold 0.05 and Sort by CHR, POS
  df <- df[ P < 0.05, ][order(CHR, POS)]

2nd step

df[, {idx = (1:.N)[which.min(P)]
      SNP[seq(max(1, idx - 5e5), min(.N, idx + 5e5))]}, by = CHR]

Saving output in different files

df[, fwrite(copy(.SD)[, SNP := SNP], paste0("output", SNP,".csv")), by = SNP]

ps. note that this answer uses fwrite, which is still in the development version of data.table. Go here for install instructions. You could simply use write.csv, however you're dealing with a big dataset so speed is quite important and fwrite is certainly one of the fastest alternatives.

Upvotes: 2

Related Questions