Reputation: 3764
I have a loop which I would like to get rid of, I just can't quite see how too. Say I have a dataframe:
tmp = data.frame(Gender = rep(c("Male", "Female"), each = 6),
Ethnicity = rep(c("White", "Asian", "Other"), 4),
Score = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12))
I then want to calculate the mean for each level in both the Gender and Ethnicity columns which would give:
$Female
[1] 9.5
$Male
[1] 3.5
$Asian
[1] 6.5
$Other
[1] 7.5
$White
[1] 5.5
This is easy enough to do, but I don't want to use loops - I'm going for speed. So I currently have the following:
for(i in c("Gender", "Ethnicity"))
print(lapply(split(tmp$Score, tmp[, i]), function(x) mean(x)))
Obviously, this uses a loop and is where I am stuck.
There may well be a function which already does this kind of thing that I am unaware of. I have looked at aggregate but I don't think that's what I want.
Upvotes: 3
Views: 206
Reputation: 23758
You should probably reconsider the output you're generating. A list containing all of the ethnicity and gender variables together is probably not the best way to go about graphing, analyzing, or presenting your data. You might be best off breaking down and writing two lines of code instead of that one off using perhaps tapply
tapply(tmp$Score, tmp$Gender, mean)
tapply(tmp$Score, tmp$Ethnicity, mean)
or aggregate
aggregate(Score ~ Gender, tmp, mean)
aggregate(Score ~ Ethnicity, tmp, mean)
And then, perhaps you might want to look at your interaction even though you suggested aggregate doesn't do what you really want.
with(tmp, tapply(Score, list(Gender, Ethnicity), mean))
aggregate(Score ~ Gender + Ethnicity, tmp, mean)
Not only do these lead you to better separation and presentation of the fundamental ideas presented by the variables but your R commands are more expressive and reflective of the intent in the data of separately coding those variables in the first place.
If your real task is to go at a number of variables any of these can be put into a loop but I would suggest you still want the output not as one single list but as a list of vectors or data.frames.
Upvotes: 0
Reputation: 120
Try the reshape2 package.
require(reshape2)
#demo
melted<-melt(tmp)
casted.gender<-dcast(melted,Gender~variable,mean) #for mean of each gender
casted.eth<-dcast(melted,Ethnicity~variable,mean) #for mean of each ethnicity
#now, combining to do for all variables at once
variables<-colnames(tmp)[-length(colnames(tmp))]
casting<-function(var.name){
return(dcast(melted,melted[,var.name]~melted$variable,mean))
}
lapply(variables, FUN=casting)
output:
[[1]]
melted[, var.name] Score
1 Female 9.5
2 Male 3.5
[[2]]
melted[, var.name] Score
1 Asian 6.5
2 Other 7.5
3 White 5.5
Upvotes: 1
Reputation: 887128
Using dplyr
library(dplyr)
library(tidyr)
tmp[,1:2] <- lapply(tmp[,1:2], as.character)
tmp %>%
gather(Var1, Var2, Gender:Ethnicity) %>%
unite(Var, Var1, Var2) %>%
group_by(Var) %>%
summarise(Score=mean(Score))
# Var Score
#1 Ethnicity_Asian 6.5
#2 Ethnicity_Other 7.5
#3 Ethnicity_White 5.5
#4 Gender_Female 9.5
#5 Gender_Male 3.5
Upvotes: 2
Reputation: 929
You can use the code:
c(tapply(tmp$Score,tmp$Gender,mean),tapply(tmp$Score,tmp$Ethnicity,mean))
Upvotes: 2
Reputation: 9582
You can nest apply functions.
sapply(c("Gender", "Ethnicity"),
function(i) {
print(lapply(split(tmp$Score, tmp[, i]), function(x) mean(x)))
})
Upvotes: 2
Reputation: 8267
You can sapply()
over the names
of tmp
, except for Score
, and then use by()
(or aggregate()
):
> sapply(setdiff(names(tmp),"Score"),function(xx)by(tmp$Score,tmp[,xx],mean))
$Gender
tmp[, xx]: Female
[1] 9.5
------------------------------------------------------------
tmp[, xx]: Male
[1] 3.5
$Ethnicity
tmp[, xx]: Asian
[1] 6.5
------------------------------------------------------------
tmp[, xx]: Other
[1] 7.5
------------------------------------------------------------
tmp[, xx]: White
[1] 5.5
However, this internally uses a loop, so it won't speed up a lot...
Upvotes: 3