R: Grouping levels of a factor across multiple files

Question

I'm new to R and struggling to group multiple levels of a factor prior to calculating means. This problem is complicated by the fact that I am doing this over hundreds of files that have variable levels of factors that need to be grouped. I see from previous posts how to address this grouping issue for single levels using levels (), but my data is too variable for this method.

Basically, I'd like to calculate both individual and then an overall mean for multiple levels of a factor. For example, I would like to calculate the mean for each species for each of the following factors present in the column Status: Crypt1, Crypt2, Crypt3, Native, Intro, and then also the overall mean for Crypt species (includes Crypt1, Crypt2, and Crypt3, but not Native or Intro). However, a species either has multiple levels of Crypt (variable, and up to Crypt8), or has Native and Intro, and means for all species at each of these levels are ultimately averaged into the same summary sheet.

For example:

Species  Status  Value
A        Crypt1    5 
A        Crypt1    6
A        Crypt2    4
A        Crypt2    8
A        Crypt3    10
A        Crypt3    50
B        Native    2
B        Native    9
B        Intro     9
B        Intro     10

I was thinking that I could use the first letter of each factor to group the Crypt factors together, but I am struggling to target the first letter because they are factors, not strings, and I am not sure how to convert between them. I'm ultimately calculating the means using aggregate(), and I can get individual means for each factor, but not for the grouped factors. Any ideas would be much appreciated, thanks!

Arun · Accepted Answer

For the individual means:

# assuming your data is in data.frame = df
require(plyr)
df.1 <- ddply(df, .(Species, Status), summarise, ind.m.Value = mean(Value))

> df.1
#   Species Status ind.m.Value
# 1       A Crypt1     5.5
# 2       A Crypt2     6.0
# 3       A Crypt3    30.0
# 4       B  Intro     9.5
# 5       B Native     5.5

For the overall mean, the idea is to remove the numbers present at the end of every entry in Status using sub/gsub.

df.1$Status2 <- gsub("[0-9]+$", "", df.1$Status)
df.2 <- ddply(df.1, .(Species, Status2), summarise, oall.m.Value = mean(ind.m.Value))

> df.2
#   Species Status2 oall.m.Value
# 1       A   Crypt     13.83333
# 2       B   Intro      9.50000
# 3       B  Native      5.50000

Is this what you're expecting?

R: Grouping levels of a factor across multiple files

Answers (2)

Related Questions