Michelle
Michelle

Reputation: 1363

Creating a replicate data frame without using rbind, follow on from earlier simulation loop issue

Rather than adding in more comments or making my original question longer, I have created another question. I have received excellent advice in the previous question (here) but I am not good enough in R to implement the suggestions in the comments.

The original code, that took ages, was:

Male.MC <-c()
for (j in 1:100)            {
    for (i in 1:nrow(Male.Distrib))  {
        u2        <- Male.Distrib$stddev_u2[i] * rnorm(1, mean = 0, sd = 1)
        mc_bca    <- Male.Distrib$FixedEff[i] + u2
        temp      <- Lambda.Value*mc_bca+1
        ginv_a    <- temp^(1/Lambda.Value)
        d2ginv_a  <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2))
        mc_amount <- ginv_a + d2ginv_a * Male.Resid.Var / 2
        z <- data.frame(
        RespondentID = Male.Distrib$RespondentID[i], 
        Subgroup     = Male.Distrib$Subgroup[i], 
        mc_amount    = mc_amount,
        IndvWeight   = Male.Distrib$INDWTS[i]/100
        )
        Male.MC <- as.data.frame(rbind(Male.MC,z))
    }
}

The replicate() answer worked well when I thought I only needed one output (mc_amount) from the function:

Male.Distrib = read.table('MaleDistrib.txt', check.names=F)

getMC <- function(df, Lambda.Value=0.4, Male.Resid.Var=12.1029420429778) {
      u2        <- df$stddev_u2 * rnorm(nrow(df), mean = 0, sd = 1)
      mc_bca    <- df$FixedEff + u2
      temp      <- Lambda.Value*mc_bca+1
      ginv_a    <- temp^(1/Lambda.Value)
      d2ginv_a  <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2))
      mc_amount <- ginv_a + d2ginv_a * Male.Resid.Var / 2
      mc_amount
}
replicate(10, getMC(Male.Distrib))

However, even with data corrections made, I am getting unexpected results so I need to be able to see the values for all the interim calculations to determine where I have gone wrong in my logic. This is where I'm stuck. I have created a smaller data frame called tempdata for testing, which is just head() from my larger dataset of 7135 observations. The tempdata set is:

    RndmEff RespondentID Subgroup RespondentID Replicates IntakeAmt RACE INDWTS    TOTWTS   GRPWTS NUMSUBJECTS TOTSUBJECTS  FixedEff stddev_u2
1  1.343753         9966        6         9966      41067 33.449808    2  41067 120622201 41657878        1466        7135  6.089918  2.645938
2 -5.856516         9967        5         9967       2322  2.533528    3   2322 120622201 22715139        1100        7135  6.755664  2.645938
3 -3.648339         9970        4         9970      17434  9.575439    2  17434 120622201 10520535        1424        7135  7.079757  2.645938
4  2.697533         9972        6         9972      21723 43.340180    2  21723 120622201 41657878        1466        7135  6.089918  2.645938
5  3.531878         9974        3         9974        375 55.660607    3    375 120622201 10791729        1061        7135  6.176319  2.645938
6  6.627767         9976        6         9976      48889 91.480049    2  48889 120622201 41657878        1466        7135  6.089918  2.645938

The updated commands that I am using are:

getMC <- function(df, Lambda.Value=0.4, Male.Resid.Var=12.1029420429778) {
    RespondentID <- df$RespondentID
    u2        <- df$stddev_u2 * rnorm(nrow(df), mean = 0, sd = 1)
    mc_bca    <- df$FixedEff + u2
    temp      <- max(Lambda.Value*mc_bca+1,Lambda.Value*Min_bca+1)
    ginv_a    <- temp^(1/Lambda.Value)
    d2ginv_a  <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2))
    mc_amount <- ginv_a + d2ginv_a * Male.Resid.Var / 2
    return(list(RespondentID, temp, ginv_a, d2ginv_a, mc_amount))
}
Test <- replicate(10, getMC(tempdata))

I get a very nice layout for my calculated variables (temp, ginv_a, d2ginv_a, mc_amount) but there are two problems with the results. These problems could be related, I don't understand enough to work out what is happening.

First, I only get 10 columns relating to the first RespondentID, so the function does not seem to be applied to the 6 that are in the dataset.

Second, I get 10 columns, but the RespondentID results are concatenated into one cell in each column. If I add u2 or mc_bca to the return list, these are also similarly concatenated into one cell. I have read the R help for return and it contains this line

value could be a series of non-empty expressions separated by commas. In that case the value returned is a list of the evaluated expressions, with names set to the expressions where these are the names of R objects. but I don't understand enough about R function programming to know if that is relevant.

I'm hoping there is a quick and obvious fix to this. I have been unable to find a similar problem of which I could copy the solution, all the examples of multiple returns from functions that I have found have used variables being calculated in the function.

I have tried the alternative of creating an empty data frame and then trying to vectorize the results into that. I'm worse at vectorizing than I am at replication.

Update: missed the min_bca value, which is -2.44478269434376

Upvotes: 0

Views: 515

Answers (1)

Maiasaura
Maiasaura

Reputation: 32986

After a couple of more edits, hopefully here is the final solution to your question.

    getMC <- function(df, Lambda.Value=0.4, Male.Resid.Var=12.1029420429778,Min_bca=-2.44478269434376) {
            u2        <- df$stddev_u2 * rnorm(nrow(df), mean = 0, sd = 1)
            mc_bca    <- df$FixedEff + u2
            temp      <- pmax((Lambda.Value*mc_bca+1),(Lambda.Value*Min_bca+1))
            ginv_a    <- temp^(1/Lambda.Value)
            d2ginv_a  <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2))
            mc_amount <- ginv_a + d2ginv_a * Male.Resid.Var / 2
            return(data.frame(RespondentID=df$RespondentID,temp=temp, ginv_a, d2ginv_a, mc_amount))
        }

   data=rep(list(tempdata),10) # change 10 to a higher number of replicates
   result_data=llply(data,getMC, .progress = "text")

Some notes: I had to troubleshoot your function on a single replicate, line by line, to find out what was wrong (this is something you should do before posting because the question above is not about this issue). max(vector1,vector2)returns a single value which makes temp the same for all RespondentID. Instead I replaced it with pmax (see ?max for an explanation).

Upvotes: 1

Related Questions