R: Defining a function (and/or using apply() or for loop) to perform a set of procedures repeatedly

Question

Language: R OS: Windows 7

I would like to know how to create a function and/or construct apply() or for() loop statement(s) that will allow me to accomplish the task described below.

I am working in R on a Windows 7 machine. sessionInfo() is pasted below my question.

I have two dataframes, SUBJ and ANNO. I would like to create a new dataframe (Output) from performing an operation on a subset of columns in SUBJ, with that column subset being defined by the results of an operation on ANNO.

Below, I first create the two fake dataframes, SUBJ and ANNO. Next, I create the empty Output dataframe, with rownames and colnames taken from SUBJ and ANNO, respectively.

Then, I perform the desired operation for the first column of ANNO. That is: 1) I process the first column of ANNO, ANNO1, identifying the set of row.names corresponding to rows where ANNO1==1 and saving that set to a character vector, ROWSlookup. 2) Then, for each row in SUBJ, I calculate the sum of values for the subset of columns that appear in the ROWSlookup list and put the resulting sum in the ANNO1 column of the Ouptut dataframe.

The actual datasets (represented by SUBJ and ANNO) are very large. So I would like to create a function and/or construct apply() or for() loop statement(s), that will enable me to efficiently complete the desired Output dataframe. That is, I want the function to create a ROWSlookup for each column of ANNO, calculate a sum of the values in the corresponding columns of SUBJ and enter that sum into the corresponding cell of Output.

# CREATE FAKE SUBJ
    SUBJ <-    matrix(c(0,0,0,1,0,0,2,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,2,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,2,0,1,0,0,1,0,0,0,2,0,0), 10, 10)`
    rownames(SUBJ) <- c("subj1", "subj2", "subj3", "subj4", "subj5", "subj6", "subj7", "subj8", "subj9", "subj10")
    colnames(SUBJ) <- c("rs1", "rs2", "rs3", "rs4", "rs5", "rs6", "rs7", "rs8", "rs9", "rs10") 
    SUBJ<- as.data.frame(SUBJ)
SUBJ

           #rs1 rs2 rs3 rs4 rs5 rs6 rs7 rs8 rs9 rs10
    #subj1    0   1   0   0   1   0   1   1   0    1
    #subj2    0   0   0   0   0   0   0   1   1    0
    #subj3    0   0   0   0   0   1   0   0   0    0
    #subj4    1   1   2   1   1   0   1   0   0    1
    #subj5    0   0   0   0   0   0   0   1   0    0
    #subj6    0   0   0   0   0   0   0   0   0    0
    #subj7    2   0   1   1   0   0   0   0   0    0
    #subj8    0   1   0   0   0   0   0   1   0    2
    #subj9    1   0   0   0   1   2   0   0   2    0
    #subj10   0   0   0   0   0   0   0   0   0    0

# CREATE FAKE ANNO
    ANNO <- matrix(c(0,0,0,1,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,1,0,1,0,1,0), 
        8, 8)
    length(c(0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0))
    rownames(ANNO) <- c("rs1", "rs2", "rs3", "rs4", "rs5", "rs6", "rs7", "rs8")
    colnames(ANNO) <- c("ANNO1","ANNO2","ANNO3","ANNO4","ANNO5","ANNO6","ANNO7","ANNO8") 
    ANNO<- as.data.frame(ANNO)

ANNO

        #ANNO1 ANNO2 ANNO3 ANNO4 ANNO5 ANNO6 ANNO7 ANNO8
    #rs1     0     0     0     0     0     1     1     1
    #rs2     0     0     0     0     1     0     1     0
    #rs3     0     1     0     1     0     0     0     1
    #rs4     1     0     1     0     0     1     0     0
    #rs5     1     0     0     0     0     0     0     1
    #rs6     0     1     0     0     0     0     0     0
    #rs7     0     0     0     0     1     0     0     1
    #rs8     0     0     0     0     0     0     0     0

# CREATE EMPTY OUTPUT DATAFRAME TO HOLD THE (EVENTUAL) PROCESSED VALUES
    Output<-data.frame(matrix(nrow=nrow(SUBJ), ncol=ncol(ANNO)))

# SET ROWNAMES AND COLNAMES OF OUTPUT DF
    row.names(Output)<- row.names(SUBJ)
    colnames(Output)<- colnames(ANNO)
    Output
           #ANNO1 ANNO2 ANNO3 ANNO4 ANNO5 ANNO6 ANNO7 ANNO8
    #subj1     NA    NA    NA    NA    NA    NA    NA    NA
    #subj2     NA    NA    NA    NA    NA    NA    NA    NA
    #subj3     NA    NA    NA    NA    NA    NA    NA    NA
    #subj4     NA    NA    NA    NA    NA    NA    NA    NA
    #subj5     NA    NA    NA    NA    NA    NA    NA    NA
    #subj6     NA    NA    NA    NA    NA    NA    NA    NA
    #subj7     NA    NA    NA    NA    NA    NA    NA    NA
    #subj8     NA    NA    NA    NA    NA    NA    NA    NA
    #subj9     NA    NA    NA    NA    NA    NA    NA    NA
    #subj10    NA    NA    NA    NA    NA    NA    NA    NA

# PROCESS FIRST COLUMN OF ANNO, ANNO1, IDENTIFYING THE row.names corresponding to rows where ANNO1==1
# SAVE THOSE row.names TO A VECTOR TO SERVE AS LOOKUP VALUES

    ROWSlookup <- row.names(ANNO[which(ANNO$ANNO1==1),])

#[1] "rs4" "rs5"

# FOR EACH ROW IN SUBJ, CALCULATE THE SUM OF VALUES WITHIN THE COLs IN ROWSlookup LIST AND PUT THE RESULTING VALUES
# IN THE ANNO1 COL OF THE OUTPUT DF (Count_TEST)

    Output$ANNO1 <- apply(SUBJ[,which(names(SUBJ) %in% ROWSlookup)],1,sum,na.rm=TRUE)
    Output
           #ANNO1 ANNO2 ANNO3 ANNO4 ANNO5 ANNO6 ANNO7 ANNO8
    #subj1      1    NA    NA    NA    NA    NA    NA    NA
    #subj2      0    NA    NA    NA    NA    NA    NA    NA
    #subj3      0    NA    NA    NA    NA    NA    NA    NA
    #subj4      2    NA    NA    NA    NA    NA    NA    NA
    #subj5      0    NA    NA    NA    NA    NA    NA    NA
    #subj6      0    NA    NA    NA    NA    NA    NA    NA
    #subj7      1    NA    NA    NA    NA    NA    NA    NA
    #subj8      0    NA    NA    NA    NA    NA    NA    NA
    #subj9      1    NA    NA    NA    NA    NA    NA    NA
    #subj10     0    NA    NA    NA    NA    NA    NA    NA



sessionInfo()

#R version 3.0.3 (2014-03-06)
#Platform: x86_64-w64-mingw32/x64 (64-bit)
#
#locale:
#[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252    LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
#[5] LC_TIME=English_Canada.1252    
#
#attached base packages:
 #[1] stats4    parallel  splines   grid      stats     graphics  grDevices utils     datasets  methods   base     
#
#other attached packages:
 #[1] QuantPsyc_1.5         boot_1.3-13           perturb_2.05          RCurl_1.95-4.5        bitops_1.0-6          car_2.0-22           
 #[7] reprtree_0.6          plotrix_3.5-10        rpart.plot_1.4-5      sqldf_0.4-7.1         RSQLite.extfuns_0.0.1 RSQLite_1.0.0        
#[13] gsubfn_0.6-6          proto_0.3-10          XML_3.98-1.1          RMySQL_0.9-3          DBI_0.3.1             mlbench_2.1-1        
#[19] polycor_0.7-8         sfsmisc_1.0-26        quantregForest_0.2-3  tree_1.0-35           maptree_1.4-7         cluster_1.15.3       
#[25] mice_2.22             VIM_4.0.0             colorspace_1.2-4      randomForest_4.6-10   ROCR_1.0-5            gplots_2.15.0        
#[31] caret_6.0-37          partykit_0.8-0        biomaRt_2.18.0        NCBI2R_1.4.6          snpStats_1.12.0       betareg_3.0-5        
#[37] arm_1.7-07            lme4_1.1-7            Rcpp_0.11.3           Matrix_1.1-4          nlme_3.1-118          mvtnorm_1.0-1        
#[43] taRifx_1.0.6          sos_1.3-8             brew_1.0-6            R.utils_1.34.0        R.oo_1.18.0           R.methodsS3_1.6.1    
#[49] rattle_3.3.0          jsonlite_0.9.13       httpuv_1.3.2          httr_0.5              gmodels_2.15.4.1      ggplot2_1.0.0        
#[55] JGR_1.7-16            iplots_1.1-7          JavaGD_0.6-1          party_1.0-18          modeltools_0.2-21     strucchange_1.5-0    
#[61] sandwich_2.3-2        zoo_1.7-11            pROC_1.7.3            e1071_1.6-4           psych_1.4.8.11        gtools_3.4.1         
#[67] functional_0.6        modeest_2.1           stringi_0.3-1         languageR_1.4.1       utility_1.3           data.table_1.9.4     
#[73] xlsx_0.5.7            xlsxjars_0.6.1        rJava_0.9-6           snow_0.3-13           doParallel_1.0.8      iterators_1.0.7      
#[79] foreach_1.4.2         reshape2_1.4          reshape_0.8.5         plyr_1.8.1            xtable_1.7-4          stringr_0.6.2        
#[85] foreign_0.8-61        Hmisc_3.14-6          Formula_1.1-2         survival_2.37-7       class_7.3-11          MASS_7.3-35          
#[91] nnet_7.3-8            Revobase_7.2.0        RevoMods_7.2.0        RevoScaleR_7.2.0      lattice_0.20-27       rpart_4.1-5          
#
#loaded via a namespace (and not attached):
 #[1] abind_1.4-0         acepack_1.3-3.3     BiocGenerics_0.8.0  BradleyTerry2_1.0-5 brglm_0.5-9         caTools_1.17.1      chron_2.3-45       
 #[8] coda_0.16-1         codetools_0.2-9     coin_1.0-24         DEoptimR_1.0-2      digest_0.6.4        flexmix_2.3-12      gdata_2.13.3       
#[15] glmnet_1.9-8        gtable_0.1.2        KernSmooth_2.23-13  latticeExtra_0.6-26 lmtest_0.9-33       minqa_1.2.4         munsell_0.4.2      
#[22] nloptr_1.0.4        pkgXMLBuilder_1.0   png_0.1-7           RColorBrewer_1.0-5  revoIpe_1.0         robustbase_0.92-2   scales_0.2.4       
#[29] sp_1.0-16           tcltk_3.0.3         tools_3.0.3         vcd_1.3-2

akrun · Accepted Answer

Here, we can first create a row/col numeric index from the comparison ANNO==1 using which with argument arr.ind=TRUE. The indx also have rownames same as the ANNO dataset. Split the rownames of the indx with the second column of indx (column index) to get a list of rownames. This rownames can be used as column index of SUBJ (same column names) to subset. For example when you do SUBJ[c('rs1','rs2')], the result will be a subset with only that columns of SUBJ. Similarly, the SUBJ[x] (where x reflects the split rownames) will subset the SUBJ as these are also the column names of SUBJ. Then, use rowSums on the subset dataset.

indx <- which(ANNO==1,arr.ind=TRUE)
Output[] <- lapply(split(row.names(indx), indx[,2]), 
               function(x) rowSums(SUBJ[x], na.rm=TRUE))

Or instead of usign lapply, we can also use Map. The idea is similar. Each list element of y will be split rownames and x will be the the whole SUBJ dataset.

Output[] <- Map(function(x,y) rowSums(x[y], na.rm=TRUE), 
                 list(SUBJ),split(row.names(indx), indx[,2]))

A data.frame is also a list but with same length of its elements. So, by using Output[] (which has the same dim of SUBJ), the result will be a data.frame while keeping the structure intact of Output.

R: Defining a function (and/or using apply() or for loop) to perform a set of procedures repeatedly

Answers (1)

Related Questions