Reputation: 99
I'm new to R and struggling to group multiple levels of a factor prior to calculating means. This problem is complicated by the fact that I am doing this over hundreds of files that have variable levels of factors that need to be grouped. I see from previous posts how to address this grouping issue for single levels using levels (), but my data is too variable for this method.
Basically, I'd like to calculate both individual and then an overall mean for multiple levels of a factor. For example, I would like to calculate the mean for each species for each of the following factors present in the column Status: Crypt1, Crypt2, Crypt3, Native, Intro, and then also the overall mean for Crypt species (includes Crypt1, Crypt2, and Crypt3, but not Native or Intro). However, a species either has multiple levels of Crypt (variable, and up to Crypt8), or has Native and Intro, and means for all species at each of these levels are ultimately averaged into the same summary sheet.
For example:
Species Status Value
A Crypt1 5
A Crypt1 6
A Crypt2 4
A Crypt2 8
A Crypt3 10
A Crypt3 50
B Native 2
B Native 9
B Intro 9
B Intro 10
I was thinking that I could use the first letter of each factor to group the Crypt factors together, but I am struggling to target the first letter because they are factors, not strings, and I am not sure how to convert between them. I'm ultimately calculating the means using aggregate(), and I can get individual means for each factor, but not for the grouped factors. Any ideas would be much appreciated, thanks!
Upvotes: 3
Views: 1977
Reputation: 193497
Here's an alternative. Conceptually, it is the same as Arun's answer, but it sticks to functions in base R, and in a way, keeps your workspace and original data somewhat tidy.
I'm assuming we're starting with a data.frame
named "temp" and that we want to create two new data.frame
s, "T1" and "T2" for individual and grouped means.
# Verify that you don't have T1 and T2 in your workspace
ls(pattern = "T[1|2]")
# character(0)
# Use `with` to generate T1 (individual means)
# and to generate T2 (group means)
with(temp, {
T1 <<- aggregate(Value ~ Species + Status, temp, mean)
temp$Status <- gsub("\\d+$", "", Status)
T2 <<- aggregate(Value ~ Species + Status, temp, mean)
})
# Now they're there!
ls(pattern = "T[1|2]")
# [1] "T1" "T2"
Notice that we used <<-
to assign the results from within with
to the global environment. Not everyone likes using that, but I think it is OK in this particular case. Here is what "T1" and "T2" look like.
T1
# Species Status Value
# 1 A Crypt1 5.5
# 2 A Crypt2 6.0
# 3 A Crypt3 30.0
# 4 B Intro 9.5
# 5 B Native 5.5
T2
# Species Status Value
# 1 A Crypt 13.83333
# 2 B Intro 9.50000
# 3 B Native 5.50000
Looking back at the with
command, it might have seemed like we had changed the value of the "Status" column. However, that was only within the environment created by using with
. Your original data.frame
is the same as it was when you started.
temp
# Species Status Value
# 1 A Crypt1 5
# 2 A Crypt1 6
# 3 A Crypt2 4
# 4 A Crypt2 8
# 5 A Crypt3 10
# 6 A Crypt3 50
# 7 B Native 2
# 8 B Native 9
# 9 B Intro 9
# 10 B Intro 10
Upvotes: 0
Reputation: 118779
For the individual means:
# assuming your data is in data.frame = df
require(plyr)
df.1 <- ddply(df, .(Species, Status), summarise, ind.m.Value = mean(Value))
> df.1
# Species Status ind.m.Value
# 1 A Crypt1 5.5
# 2 A Crypt2 6.0
# 3 A Crypt3 30.0
# 4 B Intro 9.5
# 5 B Native 5.5
For the overall mean, the idea is to remove the numbers present at the end of every entry in Status
using sub/gsub
.
df.1$Status2 <- gsub("[0-9]+$", "", df.1$Status)
df.2 <- ddply(df.1, .(Species, Status2), summarise, oall.m.Value = mean(ind.m.Value))
> df.2
# Species Status2 oall.m.Value
# 1 A Crypt 13.83333
# 2 B Intro 9.50000
# 3 B Native 5.50000
Is this what you're expecting?
Upvotes: 2