Reputation: 519
I set myself a little challenge on my way to learning R. The question was, given a sample of 500 numbers in normal distribution with mean 20, how many numbers under 20 would I get for standard deviations from 6 to 10. Just to have to learn more I decided to get 4 samples for each sd. So by the end I should have:
sd6samp1:...
sd6samp2:...
....
sd10samp4:...
My first approach, which worked was:
ddss<-c(6:10) # sd's
sam<-c(1:4) # 4 samples for each
k=0 # counter in 0
for (i in ddss) { # for each sd
for (j in sam) { # for each sample
nam <- paste("sam",i,".",j, sep="") # building a name
n <- assign(nam,rnorm(500, 20, i)) # the great assign function
k <- k+sum(n<=0)
}
print(assign(paste("ds",i,sep=""), k)) # ohh assign you're great
k=0 # reset counter
}
While looking for how to create variable names with the looping 'i', founded that 'assign' does the work but it also said:
Note though that if you are planning some simulations, many guRus would say that you should use a list.
So I thoght it would be good to learn lists...
In the meanwhile I also discover a great other option... ddss <- c(6:10)
for (i in ddss) {
print(paste('prob. x<=0), with sd=',i))
print(pnorm(0,mean=20,sd=i)*500)
}
This worked to answer the question, but the lists were still to be done... and a lot of R has yet to be learned. The main idea wasn't to know the very prob or number of negatives... but to learn R and specifically some looping.
So, I've been trying to go with the mentioned lists
My closest approach has been:
ddss<-c(6:10) # sd's to be calculated.
sam<-c(1:4) # 4 samples for each sd
liss<-list() # initializing the list
for (i in ddss) { # for each sd
liss[[i]] <- list()
for (j in sam) { # for each sample
liss[[i]][[j]] <- rnorm(500, 20, i)
print(paste('ds',i,'samp',j,'=',sum(liss[[i]][[j]]<0)))
}
}
With this one I get the information but I'm wondering about two issues (1 & 2) and some other questions (3 & 4):
I get a list of 10 elements, 6 empty ones and then 4 with sublists. I can't seem to find out how to work with elements 1:4 of the list (sd's) with the 6:9 names (the very sd's).
Even though I tried, I couldn't get to name the lists elements through the 'for' loops. Any insight on these issues would be great.
Since in this context of simulations. What do you think is better: nested lists (lists with sublists) or simple (longer) lists?
I wondered whether the 'apply' functions would be of any help here, I tried to do something, like:
vbv<-matrix(c(6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9))
lsl<-apply(vbv, 2, function(x) rnorm(500,20,x))
But it looks I'm not getting even close....
Upvotes: 3
Views: 1284
Reputation: 55735
I am going to throw in another solution using the plyr
package, which I think is tailor made for such exercises.
library(plyr)
# generate a data frame of parameters, repeating some as required
parameters = data.frame(mean = 20, sd = rep(6:10, each = 4))
# generate sample data for each combination of parameters
sample_data = mdply(df, rnorm, n = 500)
# generate answer by counting number of observations less than 20
answer = data.frame(
parameters,
obs_less_20 = rowSums(sample_data[,-c(1, 2),] < 20)
)
head(answer)
mean sd obs_less_20
1 20 6 247
2 20 6 250
3 20 6 242
4 20 6 259
5 20 7 240
6 20 7 237
Upvotes: 3
Reputation: 174908
lapply()
is helpful here, where we can just apply over the set of values for the SD. It helps to write a custom wrapper around the rnorm()
function so we can pass in different values for the various arguments of rnorm()
, and handle the k replicates (k = 4 in your example) in a nice fashion also. That wrapper is foo()
below:
foo <- function(sd, n, mean, reps = 1) {
rands <- rnorm(n * reps, mean = mean, sd = sd)
if(reps > 1)
rands <- matrix(rands, ncol = reps)
rands
}
We use it in an lapply()
call like so:
sims <- lapply(6:10, FUN = foo, mean = 20, n = 500, reps = 4)
Which gives:
R> str(sims)
List of 5
$ : num [1:500, 1:4] 30.3 22 15.6 20 19.4 ...
$ : num [1:500, 1:4] 20.9 21.7 17.7 35 30 ...
$ : num [1:500, 1:4] 17.88 26.48 5.19 19.25 15.59 ...
$ : num [1:500, 1:4] 27.41 12.72 9.38 35.09 11.08 ...
$ : num [1:500, 1:4] 16.2 11.6 20.5 35.4 27.3 ...
We can then compute the number of observations < 20 per SD
names(sims) <- paste("SD", 6:10, sep = "")
out <- lapply(sims, function(x) colSums(x < 20))
Which gives:
R> out
$SD6
[1] 218 251 253 227
$SD7
[1] 250 242 233 232
$SD8
[1] 258 241 246 274
$SD9
[1] 252 245 249 258
$SD10
[1] 253 259 241 242
@Joris suggests I show how to access elements of the list. For example, if you want the results of the simulations for a SD = 20, we could do out[[4]]
because 20 was the 4th value in the vector of SDs we applied over, or, because I named the elements of the output list out
, we can as for the results of the simulation using out[["SD10"]]
.
To Answer some of the specific points about your loops etc.,
names()
, e.g. names(mylist)
<- c("foo","bar"). You'd be better off in your loop calling
names()` once per iteration of the loop to set up the names in a single shot - you probably wouldn't want to fill the names in as you go along as that would be inefficient.foo()
to return a list so the output of lapply()
is a list of lists, we could do:Code:
bar <- function(sd, n, mean, reps = 1) {
rands <- rnorm(n * reps, mean = mean, sd = sd)
if(reps > 1)
rands <- split(rands, rep(seq_len(reps), each = n))
rands
}
sims2 <- lapply(6:10, FUN = bar, mean = 20, n = 500, reps = 4)
names(sims2) <- paste("SD", 6:10, sep = "")
out2 <- lapply(sims2, function(x) sapply(x, function(y) sum(y < 20)))
which gives the same output as before.
Upvotes: 3
Reputation: 11956
The problem is in your indexes: you are running over indexer i from ddss, which runs from 6 to 10. So in the first tour of duty in your outer loop, your first statement really says: liss[[6]]<-list()
, implying that the first 5 ones are NULL.
So if you insist on working with loops, this is what you should do (check ?seq_along
):
ddss<-c(6:10) # sd's to be calculated.
sam<-c(1:4) # 4 samples for each sd
liss<-list() # initializing the list
for (i in seq_along(ddss)) { # now, i runs from 1 to 5
liss[[i]] <- list()
for (j in sam) { # for each sample
liss[[i]][[j]] <- rnorm(500, 20, i)
print(paste('ds',ddss[i],'samp',j,'=',sum(liss[[i]][[j]]<0)))
}
names(liss[[i]])<-as.character(sam)#this should solve your naming issue (1/2)
}
names(liss)<-as.character(ddss)#this should solve your naming issue (2/2)
Note that, as always, it is a good idea to name your variables something more useful than i or j: if you'd named it curds, maybe you wouldn't have used it immediately as an indexer in a list?
Now, if you are really aiming for improvement (but want to stick to lists), you indeed want to go with the apply style functions:
liss<-lapply(ddss, function(curds){ #apply the inline function to each ds and store results in a list
return(lapply(sam, function(cursam){ #apply inline function to each sam and store results in a list
rv<-rnorm(500, 20, curds)
cat('ds',curds,'samp',cursam,'=',sum(rv<0), "\n") #maybe better for your purposes.
return(rv)
}))
})
Finally, for your case, there is not a lot of reason to actually use lists (nor do you even need to keep the sampled data for each ds/sam): you can store everything as a threedimensional array, but since you specify it as a learning exercise (hey, maybe the array thing can be your next exercise :-)), I'll leave it at that.
Upvotes: 4