SC2
SC2

Reputation: 323

Repeat a function on a data frame and store the output

I simulated a data matrix containing 200 rows x 1000 columns. It contains 0's and 1's in a binomial distribution. The probability of a 1 occurring depends on a probability matrix that I've created.

I then transpose this data matrix and convert it to a data frame. I created a function that will introduce missing data to each row of the data frame. The function will also add three columns to the data frame after the missing data is introduced. One column is the computed frequency of 1's across each of the 1000 rows. The 2nd column is the computed frequency of 0's across each row. The 3rd column is the frequency of missing values across each row.

I would like to repeat this function 500 times with the same input data frame (the one with no missing values) and output three data frames: one with 500 columns containing all of the computed frequencies of 0's (one column per simulation), one with 500 columns containing all of the computed frequencies of 1's, and one with 500 columns of the missing data frequencies.

I have seen mapply() used for something similar, but was not sure if it would work in my case. How can I repeatedly apply a function to a data frame and store the output of each computation performed within that function every time that function is repeated?

Thank you!

    ####Load Functions####
    ###Compute freq of 0's
    compute.al0 = function(GEcols){
      (sum(GEcols==0, na.rm=TRUE)/sum(!is.na(GEcols))) 
    }

    ###Compute freq of 1's
    compute.al1 = function(GEcols){
      (sum(GEcols==1, na.rm=TRUE)/sum(!is.na(GEcols)))
    }

    #Introduce missing data
    addmissing = function(GEcols){
      newdata = GEcols
      num.cols = 200
      num.miss = 10
      set.to.missing = sample(num.cols, num.miss, replace=FALSE) #select num.miss to be set to missing
      newdata[set.to.missing] = NA
      return(newdata) #why is the matrix getting transposed during this??
    }

    #Introduce missing data and re-compute freq of 0's and 1's, and missing data freq
    rep.missing = function(GEcols){
      indata = GEcols
      missdata = apply(indata,1,addmissing)
      missdata.out = as.data.frame(missdata) #have to get the df back in the right format
      missdata.out.t = t(missdata.out)
      missdata.new = as.data.frame(missdata.out.t)
      missdata.new$allele.0 = apply(missdata.new[,1:200], 1, compute.al0) #compute freq of 0's
      missdata.new$allele.1 = apply(missdata.new[,1:200], 1, compute.al1) #compute freq of 1's
      missdata.new$miss = apply(missdata.new[,1:200], 1, function(x) {(sum(is.na(x)))/200}) #compute missing
      return(missdata.new)  
    }


    #Generate a data matrix with no missing values
    datasim = matrix(0, nrow=200, ncol=1000) #pre-allocated matrix of 0's of desired size
    probmatrix = col(datasim)/1000 #probability matrix, each of the 1000 columns will have a different prob
    datasim2 = matrix(rbinom(200 * 1000,1,probmatrix), 
              nrow=200, ncol=1000, byrow=FALSE) #new matrix of 0's and 1's based on probabilities

    #Assign column names
    cnum = 1:1000
    cnum = paste("M",cnum,sep='')
    colnames(datasim2) = cnum
    #Assign row names
    rnum = 1:200
    rnum = paste("L",rnum,sep='')
    rownames(datasim2) = rnum

    datasim2 = t(datasim2) #data will be used in the transposed form
    datasim2 = as.data.frame(datasim2)

    #add 10 missing values per row and compute new frequencies
    datasim.miss = rep.missing(datasim2)

    #Now, how can I repeat the rep.missing function 
    #500 times and store the output of the new frequencies 
    #generated from each repetition?

Upvotes: 0

Views: 3246

Answers (2)

shahram
shahram

Reputation: 41

I am not sure to understand which part is where you don't know how to do. If you don't know how repeatedly store your results. one way would be to have a global variable , and inside your function you do <<- assignments instead of <- or =.

     x=c()
     func = function(i){x <<- c(x,i) }
     sapply(1:5,func)

mapply is tfor repeating a function over multiple inputs list or vectors.

you want to repeat your function 500 times. so you can always do sapply(1:500,fund)

Upvotes: 0

SC2
SC2

Reputation: 323

Update:

Frank, thank you for the replicate() suggestion. I am able to return the repetitions by changing return(missdata.new) to return(list(missdata.new)) in the rep.missing() function. I then call the function with replicate(500,rep.missing(datasim2), simplify="matrix").

This is almost exactly what I want. I would like to do

    return(list(missdata.new$allele.0, missdata.new$allele.1, missdata.new$miss))

in rep.missing() and return each of these 3 vectors as 3 column bound data frames within a list. One data frame holds the 500 repetitions of missdata.new$allele.0, one holds the 500 repetitions of missdata.new$allele.1, etc.

    replicate(500, rep.missing(datasim2), simplify="matrix")

Upvotes: 1

Related Questions