Reputation: 311
R: Defining a function (and/or using apply() or for loop) to perform a set of procedures repeatedly
Language: R OS: Windows 7
I would like to know how to create a function and/or construct apply() or for() loop statement(s) that will allow me to accomplish the task described below.
I am working in R on a Windows 7 machine. sessionInfo() is pasted below my question.
I have two dataframes, SUBJ and ANNO. I would like to create a new dataframe (Output) from performing an operation on a subset of columns in SUBJ, with that column subset being defined by the results of an operation on ANNO.
Below, I first create the two fake dataframes, SUBJ and ANNO. Next, I create the empty Output dataframe, with rownames and colnames taken from SUBJ and ANNO, respectively.
Then, I perform the desired operation for the first column of ANNO. That is: 1) I process the first column of ANNO, ANNO1, identifying the set of row.names corresponding to rows where ANNO1==1 and saving that set to a character vector, ROWSlookup. 2) Then, for each row in SUBJ, I calculate the sum of values for the subset of columns that appear in the ROWSlookup list and put the resulting sum in the ANNO1 column of the Ouptut dataframe.
The actual datasets (represented by SUBJ and ANNO) are very large. So I would like to create a function and/or construct apply() or for() loop statement(s), that will enable me to efficiently complete the desired Output dataframe. That is, I want the function to create a ROWSlookup for each column of ANNO, calculate a sum of the values in the corresponding columns of SUBJ and enter that sum into the corresponding cell of Output.
# CREATE FAKE SUBJ
SUBJ <- matrix(c(0,0,0,1,0,0,2,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,2,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,2,0,1,0,0,1,0,0,0,2,0,0), 10, 10)`
rownames(SUBJ) <- c("subj1", "subj2", "subj3", "subj4", "subj5", "subj6", "subj7", "subj8", "subj9", "subj10")
colnames(SUBJ) <- c("rs1", "rs2", "rs3", "rs4", "rs5", "rs6", "rs7", "rs8", "rs9", "rs10")
SUBJ<- as.data.frame(SUBJ)
SUBJ
#rs1 rs2 rs3 rs4 rs5 rs6 rs7 rs8 rs9 rs10
#subj1 0 1 0 0 1 0 1 1 0 1
#subj2 0 0 0 0 0 0 0 1 1 0
#subj3 0 0 0 0 0 1 0 0 0 0
#subj4 1 1 2 1 1 0 1 0 0 1
#subj5 0 0 0 0 0 0 0 1 0 0
#subj6 0 0 0 0 0 0 0 0 0 0
#subj7 2 0 1 1 0 0 0 0 0 0
#subj8 0 1 0 0 0 0 0 1 0 2
#subj9 1 0 0 0 1 2 0 0 2 0
#subj10 0 0 0 0 0 0 0 0 0 0
# CREATE FAKE ANNO
ANNO <- matrix(c(0,0,0,1,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,1,0,1,0,1,0),
8, 8)
length(c(0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0))
rownames(ANNO) <- c("rs1", "rs2", "rs3", "rs4", "rs5", "rs6", "rs7", "rs8")
colnames(ANNO) <- c("ANNO1","ANNO2","ANNO3","ANNO4","ANNO5","ANNO6","ANNO7","ANNO8")
ANNO<- as.data.frame(ANNO)
ANNO
#ANNO1 ANNO2 ANNO3 ANNO4 ANNO5 ANNO6 ANNO7 ANNO8
#rs1 0 0 0 0 0 1 1 1
#rs2 0 0 0 0 1 0 1 0
#rs3 0 1 0 1 0 0 0 1
#rs4 1 0 1 0 0 1 0 0
#rs5 1 0 0 0 0 0 0 1
#rs6 0 1 0 0 0 0 0 0
#rs7 0 0 0 0 1 0 0 1
#rs8 0 0 0 0 0 0 0 0
# CREATE EMPTY OUTPUT DATAFRAME TO HOLD THE (EVENTUAL) PROCESSED VALUES
Output<-data.frame(matrix(nrow=nrow(SUBJ), ncol=ncol(ANNO)))
# SET ROWNAMES AND COLNAMES OF OUTPUT DF
row.names(Output)<- row.names(SUBJ)
colnames(Output)<- colnames(ANNO)
Output
#ANNO1 ANNO2 ANNO3 ANNO4 ANNO5 ANNO6 ANNO7 ANNO8
#subj1 NA NA NA NA NA NA NA NA
#subj2 NA NA NA NA NA NA NA NA
#subj3 NA NA NA NA NA NA NA NA
#subj4 NA NA NA NA NA NA NA NA
#subj5 NA NA NA NA NA NA NA NA
#subj6 NA NA NA NA NA NA NA NA
#subj7 NA NA NA NA NA NA NA NA
#subj8 NA NA NA NA NA NA NA NA
#subj9 NA NA NA NA NA NA NA NA
#subj10 NA NA NA NA NA NA NA NA
# PROCESS FIRST COLUMN OF ANNO, ANNO1, IDENTIFYING THE row.names corresponding to rows where ANNO1==1
# SAVE THOSE row.names TO A VECTOR TO SERVE AS LOOKUP VALUES
ROWSlookup <- row.names(ANNO[which(ANNO$ANNO1==1),])
#[1] "rs4" "rs5"
# FOR EACH ROW IN SUBJ, CALCULATE THE SUM OF VALUES WITHIN THE COLs IN ROWSlookup LIST AND PUT THE RESULTING VALUES
# IN THE ANNO1 COL OF THE OUTPUT DF (Count_TEST)
Output$ANNO1 <- apply(SUBJ[,which(names(SUBJ) %in% ROWSlookup)],1,sum,na.rm=TRUE)
Output
#ANNO1 ANNO2 ANNO3 ANNO4 ANNO5 ANNO6 ANNO7 ANNO8
#subj1 1 NA NA NA NA NA NA NA
#subj2 0 NA NA NA NA NA NA NA
#subj3 0 NA NA NA NA NA NA NA
#subj4 2 NA NA NA NA NA NA NA
#subj5 0 NA NA NA NA NA NA NA
#subj6 0 NA NA NA NA NA NA NA
#subj7 1 NA NA NA NA NA NA NA
#subj8 0 NA NA NA NA NA NA NA
#subj9 1 NA NA NA NA NA NA NA
#subj10 0 NA NA NA NA NA NA NA
sessionInfo()
#R version 3.0.3 (2014-03-06)
#Platform: x86_64-w64-mingw32/x64 (64-bit)
#
#locale:
#[1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252 LC_MONETARY=English_Canada.1252 LC_NUMERIC=C
#[5] LC_TIME=English_Canada.1252
#
#attached base packages:
#[1] stats4 parallel splines grid stats graphics grDevices utils datasets methods base
#
#other attached packages:
#[1] QuantPsyc_1.5 boot_1.3-13 perturb_2.05 RCurl_1.95-4.5 bitops_1.0-6 car_2.0-22
#[7] reprtree_0.6 plotrix_3.5-10 rpart.plot_1.4-5 sqldf_0.4-7.1 RSQLite.extfuns_0.0.1 RSQLite_1.0.0
#[13] gsubfn_0.6-6 proto_0.3-10 XML_3.98-1.1 RMySQL_0.9-3 DBI_0.3.1 mlbench_2.1-1
#[19] polycor_0.7-8 sfsmisc_1.0-26 quantregForest_0.2-3 tree_1.0-35 maptree_1.4-7 cluster_1.15.3
#[25] mice_2.22 VIM_4.0.0 colorspace_1.2-4 randomForest_4.6-10 ROCR_1.0-5 gplots_2.15.0
#[31] caret_6.0-37 partykit_0.8-0 biomaRt_2.18.0 NCBI2R_1.4.6 snpStats_1.12.0 betareg_3.0-5
#[37] arm_1.7-07 lme4_1.1-7 Rcpp_0.11.3 Matrix_1.1-4 nlme_3.1-118 mvtnorm_1.0-1
#[43] taRifx_1.0.6 sos_1.3-8 brew_1.0-6 R.utils_1.34.0 R.oo_1.18.0 R.methodsS3_1.6.1
#[49] rattle_3.3.0 jsonlite_0.9.13 httpuv_1.3.2 httr_0.5 gmodels_2.15.4.1 ggplot2_1.0.0
#[55] JGR_1.7-16 iplots_1.1-7 JavaGD_0.6-1 party_1.0-18 modeltools_0.2-21 strucchange_1.5-0
#[61] sandwich_2.3-2 zoo_1.7-11 pROC_1.7.3 e1071_1.6-4 psych_1.4.8.11 gtools_3.4.1
#[67] functional_0.6 modeest_2.1 stringi_0.3-1 languageR_1.4.1 utility_1.3 data.table_1.9.4
#[73] xlsx_0.5.7 xlsxjars_0.6.1 rJava_0.9-6 snow_0.3-13 doParallel_1.0.8 iterators_1.0.7
#[79] foreach_1.4.2 reshape2_1.4 reshape_0.8.5 plyr_1.8.1 xtable_1.7-4 stringr_0.6.2
#[85] foreign_0.8-61 Hmisc_3.14-6 Formula_1.1-2 survival_2.37-7 class_7.3-11 MASS_7.3-35
#[91] nnet_7.3-8 Revobase_7.2.0 RevoMods_7.2.0 RevoScaleR_7.2.0 lattice_0.20-27 rpart_4.1-5
#
#loaded via a namespace (and not attached):
#[1] abind_1.4-0 acepack_1.3-3.3 BiocGenerics_0.8.0 BradleyTerry2_1.0-5 brglm_0.5-9 caTools_1.17.1 chron_2.3-45
#[8] coda_0.16-1 codetools_0.2-9 coin_1.0-24 DEoptimR_1.0-2 digest_0.6.4 flexmix_2.3-12 gdata_2.13.3
#[15] glmnet_1.9-8 gtable_0.1.2 KernSmooth_2.23-13 latticeExtra_0.6-26 lmtest_0.9-33 minqa_1.2.4 munsell_0.4.2
#[22] nloptr_1.0.4 pkgXMLBuilder_1.0 png_0.1-7 RColorBrewer_1.0-5 revoIpe_1.0 robustbase_0.92-2 scales_0.2.4
#[29] sp_1.0-16 tcltk_3.0.3 tools_3.0.3 vcd_1.3-2
Upvotes: 1
Views: 125
Reputation: 886938
Here, we can first create a row/col numeric index from the comparison ANNO==1
using which
with argument arr.ind=TRUE
. The indx
also have rownames
same as the ANNO
dataset. Split
the rownames of the indx
with the second
column of indx
(column
index) to get a list of rownames. This rownames can be used as column index of SUBJ
(same column names) to subset. For example when you do SUBJ[c('rs1','rs2')]
, the result will be a subset with only that columns of SUBJ
. Similarly, the SUBJ[x]
(where x
reflects the split rownames) will subset the SUBJ
as these are also the column names of SUBJ
. Then, use rowSums
on the subset dataset.
indx <- which(ANNO==1,arr.ind=TRUE)
Output[] <- lapply(split(row.names(indx), indx[,2]),
function(x) rowSums(SUBJ[x], na.rm=TRUE))
Or instead of usign lapply
, we can also use Map
. The idea is similar. Each list
element of y
will be split
rownames and x
will be the the whole SUBJ
dataset.
Output[] <- Map(function(x,y) rowSums(x[y], na.rm=TRUE),
list(SUBJ),split(row.names(indx), indx[,2]))
A data.frame
is also a list
but with same length of its elements. So, by using Output[]
(which has the same dim
of SUBJ
), the result will be a data.frame
while keeping the structure intact of Output
.
Upvotes: 1