fioghual
fioghual

Reputation: 519

Using lists for simulation

I set myself a little challenge on my way to learning R. The question was, given a sample of 500 numbers in normal distribution with mean 20, how many numbers under 20 would I get for standard deviations from 6 to 10. Just to have to learn more I decided to get 4 samples for each sd. So by the end I should have:

sd6samp1:...

sd6samp2:...

....

sd10samp4:...

My first approach, which worked was:

 ddss<-c(6:10) # sd's
 sam<-c(1:4) # 4 samples for each
 k=0  # counter in 0
 for (i in ddss) {   # for each sd
   for (j in sam) {  # for each sample
     nam <- paste("sam",i,".",j, sep="") # building a name
     n <- assign(nam,rnorm(500, 20, i))  # the great assign function
     k <- k+sum(n<=0)
   }
   print(assign(paste("ds",i,sep=""), k)) # ohh assign you're great
   k=0 # reset counter
 }

While looking for how to create variable names with the looping 'i', founded that 'assign' does the work but it also said:

Note though that if you are planning some simulations, many guRus would say that you should use a list.

So I thoght it would be good to learn lists...

In the meanwhile I also discover a great other option... ddss <- c(6:10)

for (i in ddss) {
   print(paste('prob. x<=0), with sd=',i))
   print(pnorm(0,mean=20,sd=i)*500)
}

This worked to answer the question, but the lists were still to be done... and a lot of R has yet to be learned. The main idea wasn't to know the very prob or number of negatives... but to learn R and specifically some looping.

So, I've been trying to go with the mentioned lists

My closest approach has been:

ddss<-c(6:10) # sd's to be calculated.
sam<-c(1:4) # 4 samples for each sd
liss<-list()  # initializing the list
for (i in ddss) {   # for each sd
   liss[[i]] <- list()
   for (j in sam) {  # for each sample
      liss[[i]][[j]] <- rnorm(500, 20, i)
      print(paste('ds',i,'samp',j,'=',sum(liss[[i]][[j]]<0)))
   }
}

With this one I get the information but I'm wondering about two issues (1 & 2) and some other questions (3 & 4):

  1. I get a list of 10 elements, 6 empty ones and then 4 with sublists. I can't seem to find out how to work with elements 1:4 of the list (sd's) with the 6:9 names (the very sd's).

  2. Even though I tried, I couldn't get to name the lists elements through the 'for' loops. Any insight on these issues would be great.

  3. Since in this context of simulations. What do you think is better: nested lists (lists with sublists) or simple (longer) lists?

  4. I wondered whether the 'apply' functions would be of any help here, I tried to do something, like:

vbv<-matrix(c(6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9))
lsl<-apply(vbv, 2, function(x) rnorm(500,20,x))

But it looks I'm not getting even close....

Upvotes: 3

Views: 1284

Answers (3)

Ramnath
Ramnath

Reputation: 55735

I am going to throw in another solution using the plyr package, which I think is tailor made for such exercises.

library(plyr)

# generate a data frame of parameters, repeating some as required
parameters  = data.frame(mean = 20, sd = rep(6:10, each = 4))

# generate sample data for each combination of parameters
sample_data = mdply(df, rnorm, n = 500)

# generate answer by counting number of observations less than 20
answer = data.frame(
    parameters, 
    obs_less_20 = rowSums(sample_data[,-c(1, 2),] < 20)
)

head(answer)

mean sd obs_less_20
1   20  6         247
2   20  6         250
3   20  6         242
4   20  6         259
5   20  7         240
6   20  7         237

Upvotes: 3

Gavin Simpson
Gavin Simpson

Reputation: 174908

lapply() is helpful here, where we can just apply over the set of values for the SD. It helps to write a custom wrapper around the rnorm() function so we can pass in different values for the various arguments of rnorm(), and handle the k replicates (k = 4 in your example) in a nice fashion also. That wrapper is foo() below:

foo <- function(sd, n, mean, reps = 1) {
    rands <- rnorm(n * reps, mean = mean, sd = sd)
    if(reps > 1)
        rands <- matrix(rands, ncol = reps)
    rands
}

We use it in an lapply() call like so:

sims <- lapply(6:10, FUN = foo, mean = 20, n = 500, reps = 4)

Which gives:

R> str(sims)
List of 5
 $ : num [1:500, 1:4] 30.3 22 15.6 20 19.4 ...
 $ : num [1:500, 1:4] 20.9 21.7 17.7 35 30 ...
 $ : num [1:500, 1:4] 17.88 26.48 5.19 19.25 15.59 ...
 $ : num [1:500, 1:4] 27.41 12.72 9.38 35.09 11.08 ...
 $ : num [1:500, 1:4] 16.2 11.6 20.5 35.4 27.3 ...

We can then compute the number of observations < 20 per SD

names(sims) <- paste("SD", 6:10, sep = "")
out <- lapply(sims, function(x) colSums(x < 20))

Which gives:

R> out
$SD6
[1] 218 251 253 227

$SD7
[1] 250 242 233 232

$SD8
[1] 258 241 246 274

$SD9
[1] 252 245 249 258

$SD10
[1] 253 259 241 242

@Joris suggests I show how to access elements of the list. For example, if you want the results of the simulations for a SD = 20, we could do out[[4]] because 20 was the 4th value in the vector of SDs we applied over, or, because I named the elements of the output list out, we can as for the results of the simulation using out[["SD10"]].

To Answer some of the specific points about your loops etc.,

  • to add names to a list use names(), e.g. names(mylist) <- c("foo","bar"). You'd be better off in your loop callingnames()` once per iteration of the loop to set up the names in a single shot - you probably wouldn't want to fill the names in as you go along as that would be inefficient.
  • I don't think it makes too much difference whether you use a nested list or a list containing a matrix as per my example. To alter foo() to return a list so the output of lapply() is a list of lists, we could do:

Code:

bar <- function(sd, n, mean, reps = 1) {
    rands <- rnorm(n * reps, mean = mean, sd = sd)
    if(reps > 1)
        rands <- split(rands, rep(seq_len(reps), each = n))
    rands
}
sims2 <- lapply(6:10, FUN = bar, mean = 20, n = 500, reps = 4)
names(sims2) <- paste("SD", 6:10, sep = "")
out2 <- lapply(sims2, function(x) sapply(x, function(y) sum(y < 20)))

which gives the same output as before.

Upvotes: 3

Nick Sabbe
Nick Sabbe

Reputation: 11956

The problem is in your indexes: you are running over indexer i from ddss, which runs from 6 to 10. So in the first tour of duty in your outer loop, your first statement really says: liss[[6]]<-list(), implying that the first 5 ones are NULL.

So if you insist on working with loops, this is what you should do (check ?seq_along):

ddss<-c(6:10) # sd's to be calculated.
sam<-c(1:4) # 4 samples for each sd
liss<-list()  # initializing the list
for (i in seq_along(ddss)) {   # now, i runs from 1 to 5
   liss[[i]] <- list()
   for (j in sam) {  # for each sample
      liss[[i]][[j]] <- rnorm(500, 20, i)
      print(paste('ds',ddss[i],'samp',j,'=',sum(liss[[i]][[j]]<0)))
   }
   names(liss[[i]])<-as.character(sam)#this should solve your naming issue (1/2)
}
names(liss)<-as.character(ddss)#this should solve your naming issue (2/2)

Note that, as always, it is a good idea to name your variables something more useful than i or j: if you'd named it curds, maybe you wouldn't have used it immediately as an indexer in a list?

Now, if you are really aiming for improvement (but want to stick to lists), you indeed want to go with the apply style functions:

liss<-lapply(ddss, function(curds){ #apply the inline function to each ds and store results in a list
  return(lapply(sam, function(cursam){ #apply inline function to each sam and store results in a list
    rv<-rnorm(500, 20, curds)
    cat('ds',curds,'samp',cursam,'=',sum(rv<0), "\n") #maybe better for your purposes.
    return(rv)
  }))
}) 

Finally, for your case, there is not a lot of reason to actually use lists (nor do you even need to keep the sampled data for each ds/sam): you can store everything as a threedimensional array, but since you specify it as a learning exercise (hey, maybe the array thing can be your next exercise :-)), I'll leave it at that.

Upvotes: 4

Related Questions