Reputation: 323
I simulated a data matrix containing 200 rows x 1000 columns. It contains 0's and 1's in a binomial distribution. The probability of a 1 occurring depends on a probability matrix that I've created.
I then transpose this data matrix and convert it to a data frame. I created a function that will introduce missing data to each row of the data frame. The function will also add three columns to the data frame after the missing data is introduced. One column is the computed frequency of 1's across each of the 1000 rows. The 2nd column is the computed frequency of 0's across each row. The 3rd column is the frequency of missing values across each row.
I would like to repeat this function 500 times with the same input data frame (the one with no missing values) and output three data frames: one with 500 columns containing all of the computed frequencies of 0's (one column per simulation), one with 500 columns containing all of the computed frequencies of 1's, and one with 500 columns of the missing data frequencies.
I have seen mapply()
used for something similar, but was not sure if it would work in my case. How can I repeatedly apply a function to a data frame and store the output of each computation performed within that function every time that function is repeated?
Thank you!
####Load Functions####
###Compute freq of 0's
compute.al0 = function(GEcols){
(sum(GEcols==0, na.rm=TRUE)/sum(!is.na(GEcols)))
}
###Compute freq of 1's
compute.al1 = function(GEcols){
(sum(GEcols==1, na.rm=TRUE)/sum(!is.na(GEcols)))
}
#Introduce missing data
addmissing = function(GEcols){
newdata = GEcols
num.cols = 200
num.miss = 10
set.to.missing = sample(num.cols, num.miss, replace=FALSE) #select num.miss to be set to missing
newdata[set.to.missing] = NA
return(newdata) #why is the matrix getting transposed during this??
}
#Introduce missing data and re-compute freq of 0's and 1's, and missing data freq
rep.missing = function(GEcols){
indata = GEcols
missdata = apply(indata,1,addmissing)
missdata.out = as.data.frame(missdata) #have to get the df back in the right format
missdata.out.t = t(missdata.out)
missdata.new = as.data.frame(missdata.out.t)
missdata.new$allele.0 = apply(missdata.new[,1:200], 1, compute.al0) #compute freq of 0's
missdata.new$allele.1 = apply(missdata.new[,1:200], 1, compute.al1) #compute freq of 1's
missdata.new$miss = apply(missdata.new[,1:200], 1, function(x) {(sum(is.na(x)))/200}) #compute missing
return(missdata.new)
}
#Generate a data matrix with no missing values
datasim = matrix(0, nrow=200, ncol=1000) #pre-allocated matrix of 0's of desired size
probmatrix = col(datasim)/1000 #probability matrix, each of the 1000 columns will have a different prob
datasim2 = matrix(rbinom(200 * 1000,1,probmatrix),
nrow=200, ncol=1000, byrow=FALSE) #new matrix of 0's and 1's based on probabilities
#Assign column names
cnum = 1:1000
cnum = paste("M",cnum,sep='')
colnames(datasim2) = cnum
#Assign row names
rnum = 1:200
rnum = paste("L",rnum,sep='')
rownames(datasim2) = rnum
datasim2 = t(datasim2) #data will be used in the transposed form
datasim2 = as.data.frame(datasim2)
#add 10 missing values per row and compute new frequencies
datasim.miss = rep.missing(datasim2)
#Now, how can I repeat the rep.missing function
#500 times and store the output of the new frequencies
#generated from each repetition?
Upvotes: 0
Views: 3246
Reputation: 41
I am not sure to understand which part is where you don't know how to do. If you don't know how repeatedly store your results. one way would be to have a global variable , and inside your function you do <<- assignments instead of <- or =.
x=c()
func = function(i){x <<- c(x,i) }
sapply(1:5,func)
mapply
is tfor repeating a function over multiple inputs list or vectors.
you want to repeat your function 500 times. so you can always do
sapply(1:500,fund)
Upvotes: 0
Reputation: 323
Update:
Frank, thank you for the replicate()
suggestion. I am able to return the repetitions by changing return(missdata.new)
to return(list(missdata.new))
in the rep.missing()
function. I then call the function with replicate(500,rep.missing(datasim2), simplify="matrix")
.
This is almost exactly what I want. I would like to do
return(list(missdata.new$allele.0, missdata.new$allele.1, missdata.new$miss))
in rep.missing()
and return each of these 3 vectors as 3 column bound data frames within a list. One data frame holds the 500 repetitions of missdata.new$allele.0
, one holds the 500 repetitions of missdata.new$allele.1
, etc.
replicate(500, rep.missing(datasim2), simplify="matrix")
Upvotes: 1