R: Defining an index based on mapping file for processing in lapply

Question

Hello, This is a related question to my previous post, R: Defining a function (and/or using apply() or for loop) to perform a set of procedures repeatedly.

I would like to know how to create an index (or set up a loop) that will allow me to accomplish the same task as in the previous post, but where the annotation file is a simple 2-column map between rsIDs and Genes. I suspect the solution will be similar, but I`m having trouble modifying the solution provided by akrun to accomplish the same goal.

I am still working in R on a Windows 7 machine. sessionInfo() is pasted below my question.

I have two dataframes, SUBJ (same as before) and MAP. I would like to create a new dataframe (Output), which will be the result of repeatedly performing an operation on subsets of columns in SUBJ, with those column subsets being defined by subgrouping the MAP file.

Below, I first create the two fake dataframes, SUBJ and MAP. Next, I create the empty Output dataframe, with rownames and colnames taken from SUBJ and MAP, respectively.

Then, I perform the desired operation for the first subset of MAP. That is: 1) Process the first group from MAP, Gene1, identifying the set of rsIDs corresponding to the rows where Gene=Gene1 and saving that set to a character vector, ROWSlookup. 2) Then, for each row in SUBJ, calculate the sum of values for the subset of columns that appear in the ROWSlookup list and put the resulting sum in the Gene1 column of the Ouptut dataframe.

The actual datasets (represented by SUBJ and MAP) are very large. So I would like to create a function and/or construct apply() or for() loop statement(s), that will enable me to efficiently complete the desired Output dataframe. That is, I want the function to create a ROWSlookup for each unique Gene value in MAP, calculate a sum of the values in the corresponding columns of SUBJ and enter that sum into the corresponding cell of Output.

# CREATE FAKE SUBJ
SUBJ <- matrix(c(0,0,0,1,0,0,2,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,2,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,2,0,1,0,0,1,0,0,0,2,0,0), 10, 10)
rownames(SUBJ) <- c("subj1", "subj2", "subj3", "subj4", "subj5", "subj6", "subj7", "subj8", "subj9", "subj10")
colnames(SUBJ) <- c("rs1", "rs2", "rs3", "rs4", "rs5", "rs6", "rs7", "rs8", "rs9", "rs10") 
SUBJ<- as.data.frame(SUBJ)
SUBJ
       #rs1 rs2 rs3 rs4 rs5 rs6 rs7 rs8 rs9 rs10
#subj1    0   1   0   0   1   0   1   1   0    1
#subj2    0   0   0   0   0   0   0   1   1    0
#subj3    0   0   0   0   0   1   0   0   0    0
#subj4    1   1   2   1   1   0   1   0   0    1
#subj5    0   0   0   0   0   0   0   1   0    0
#subj6    0   0   0   0   0   0   0   0   0    0
#subj7    2   0   1   1   0   0   0   0   0    0
#subj8    0   1   0   0   0   0   0   1   0    2
#subj9    1   0   0   0   1   2   0   0   2    0
#subj10   0   0   0   0   0   0   0   0   0    0

# CREATE FAKE MAP
MAP <- cbind(c("rs1", "rs2", "rs3", "rs4", "rs5", "rs6", "rs7", "rs8","rs9", "rs10", "rs11", "rs12", "rs13", "rs14", "rs15"), c("Gene1", "Gene1", "Gene1", "Gene2", "Gene3", "Gene3", "Gene3", "Gene4","Gene4", "Gene4", "Gene5", "Gene6", "Gene7", "Gene7", "Gene7"))
colnames(MAP) <- c("rsID", "Gene") 
MAP<- as.data.frame(MAP)
MAP
   #rsID  Gene
#1   rs1 Gene1
#2   rs2 Gene1
#3   rs3 Gene1
#4   rs4 Gene2
#5   rs5 Gene3
#6   rs6 Gene3
#7   rs7 Gene3
#8   rs8 Gene4
#9   rs9 Gene4
#10 rs10 Gene4
#11 rs11 Gene5
#12 rs12 Gene6
#13 rs13 Gene7
#14 rs14 Gene7
#15 rs15 Gene7

# CREATE EMPTY OUTPUT DATAFRAME TO HOLD THE (EVENTUAL) PROCESSED VALUES
genes <- unique(MAP$Gene)

Output<-data.frame(matrix(nrow=nrow(SUBJ), ncol=length(genes)))

# SET ROWNAMES AND COLNAMES OF OUTPUT DF
row.names(Output)<- row.names(SUBJ)
colnames(Output)<- genes
Output
       #Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Gene7
#subj1     NA    NA    NA    NA    NA    NA    NA
#subj2     NA    NA    NA    NA    NA    NA    NA
#subj3     NA    NA    NA    NA    NA    NA    NA
#subj4     NA    NA    NA    NA    NA    NA    NA
#subj5     NA    NA    NA    NA    NA    NA    NA
#subj6     NA    NA    NA    NA    NA    NA    NA
#subj7     NA    NA    NA    NA    NA    NA    NA
#subj8     NA    NA    NA    NA    NA    NA    NA
#subj9     NA    NA    NA    NA    NA    NA    NA
#subj10    NA    NA    NA    NA    NA    NA    NA

# PROCESS FIRST Gene in MAP, IDENTIFYING THE rsIDs corresponding to rows where Gene=Gene1
# SAVE THOSE rsIDs TO A VECTOR TO SERVE AS LOOKUP VALUES
ROWSlookup <- MAP[which(MAP$Gene=="Gene1"),]$rsID
#[1] rs1 rs2 rs3

# FOR EACH ROW IN SUBJ, CALCULATE THE SUM OF VALUES WITHIN THE COLs IN ROWSlookup LIST AND PUT THE RESULTING VALUES
# IN THE Gene1 COL OF THE OUTPUT DF (Output)
Output$Gene1 <- apply(SUBJ[,which(names(SUBJ) %in% ROWSlookup)],1,sum, na.rm=TRUE)
Output
       #Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Gene7
#subj1      1    NA    NA    NA    NA    NA    NA
#subj2      0    NA    NA    NA    NA    NA    NA
#subj3      0    NA    NA    NA    NA    NA    NA
#subj4      4    NA    NA    NA    NA    NA    NA
#subj5      0    NA    NA    NA    NA    NA    NA
#subj6      0    NA    NA    NA    NA    NA    NA
#subj7      3    NA    NA    NA    NA    NA    NA
#subj8      1    NA    NA    NA    NA    NA    NA
#subj9      1    NA    NA    NA    NA    NA    NA
#subj10     0    NA    NA    NA    NA    NA    NA



#Language: R
#OS: Windows 7
sessionInfo()
#R version 3.0.3 (2014-03-06)
#Platform: x86_64-w64-mingw32/x64 (64-bit)
#
#locale:
#[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252    LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
#[5] LC_TIME=English_Canada.1252    
#
#attached base packages:
 #[1] stats4    parallel  splines   grid      stats     graphics  grDevices utils     datasets  methods   base     
#
#other attached packages:
 #[1] QuantPsyc_1.5         boot_1.3-13           perturb_2.05          RCurl_1.95-4.5        bitops_1.0-6          car_2.0-22           
 #[7] reprtree_0.6          plotrix_3.5-10        rpart.plot_1.4-5      sqldf_0.4-7.1         RSQLite.extfuns_0.0.1 RSQLite_1.0.0        
#[13] gsubfn_0.6-6          proto_0.3-10          XML_3.98-1.1          RMySQL_0.9-3          DBI_0.3.1             mlbench_2.1-1        
#[19] polycor_0.7-8         sfsmisc_1.0-26        quantregForest_0.2-3  tree_1.0-35           maptree_1.4-7         cluster_1.15.3       
#[25] mice_2.22             VIM_4.0.0             colorspace_1.2-4      randomForest_4.6-10   ROCR_1.0-5            gplots_2.15.0        
#[31] caret_6.0-37          partykit_0.8-0        biomaRt_2.18.0        NCBI2R_1.4.6          snpStats_1.12.0       betareg_3.0-5        
#[37] arm_1.7-07            lme4_1.1-7            Rcpp_0.11.3           Matrix_1.1-4          nlme_3.1-118          mvtnorm_1.0-1        
#[43] taRifx_1.0.6          sos_1.3-8             brew_1.0-6            R.utils_1.34.0        R.oo_1.18.0           R.methodsS3_1.6.1    
#[49] rattle_3.3.0          jsonlite_0.9.13       httpuv_1.3.2          httr_0.5              gmodels_2.15.4.1      ggplot2_1.0.0        
#[55] JGR_1.7-16            iplots_1.1-7          JavaGD_0.6-1          party_1.0-18          modeltools_0.2-21     strucchange_1.5-0    
#[61] sandwich_2.3-2        zoo_1.7-11            pROC_1.7.3            e1071_1.6-4           psych_1.4.8.11        gtools_3.4.1         
#[67] functional_0.6        modeest_2.1           stringi_0.3-1         languageR_1.4.1       utility_1.3           data.table_1.9.4     
#[73] xlsx_0.5.7            xlsxjars_0.6.1        rJava_0.9-6           snow_0.3-13           doParallel_1.0.8      iterators_1.0.7      
#[79] foreach_1.4.2         reshape2_1.4          reshape_0.8.5         plyr_1.8.1            xtable_1.7-4          stringr_0.6.2        
#[85] foreign_0.8-61        Hmisc_3.14-6          Formula_1.1-2         survival_2.37-7       class_7.3-11          MASS_7.3-35          
#[91] nnet_7.3-8            Revobase_7.2.0        RevoMods_7.2.0        RevoScaleR_7.2.0      lattice_0.20-27       rpart_4.1-5          
#
#loaded via a namespace (and not attached):
 #[1] abind_1.4-0         acepack_1.3-3.3     BiocGenerics_0.8.0  BradleyTerry2_1.0-5 brglm_0.5-9         caTools_1.17.1      chron_2.3-45       
 #[8] coda_0.16-1         codetools_0.2-9     coin_1.0-24         DEoptimR_1.0-2      digest_0.6.4        flexmix_2.3-12      gdata_2.13.3       
#[15] glmnet_1.9-8        gtable_0.1.2        KernSmooth_2.23-13  latticeExtra_0.6-26 lmtest_0.9-33       minqa_1.2.4         munsell_0.4.2      
#[22] nloptr_1.0.4        pkgXMLBuilder_1.0   png_0.1-7           RColorBrewer_1.0-5  revoIpe_1.0         robustbase_0.92-2   scales_0.2.4       
#[29] sp_1.0-16           tcltk_3.0.3         tools_3.0.3         vcd_1.3-2

akrun · Accepted Answer

You could try

 Output[] <- Map(function(x,y) {
        lookup <- as.character(x[x$Gene==y,]$rsID)
        indx <- na.omit(match(lookup, colnames(SUBJ)))
           if(length(indx)>0){
             rowSums(SUBJ[indx], na.rm=TRUE) 
             }
           else NA}, 
         list(MAP), genes)

Or

 MAP1 <- MAP[MAP$rsID %in% colnames(SUBJ),]
 lst <- lapply(split(as.character(MAP1$rsID), MAP1$Gene,drop=TRUE),
                      function(x) rowSums(SUBJ[x], na.rm=TRUE))
 Output[names(lst)] <- lst

R: Defining an index based on mapping file for processing in lapply

Answers (1)

Related Questions