Reputation: 2604

Operate over levels of two factors

I have a dataset that looks something like this, with many classes, each with many (5-10) subclasses, each with a value associated with it:

> data.frame(class=rep(letters[1:4], each=4), subclass=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8), value=1:16)
   class subclass value
1      a        1     1
2      a        1     2
3      a        2     3
4      a        2     4
5      b        3     5
6      b        3     6
7      b        4     7
8      b        4     8
9      c        5     9
10     c        5    10
11     c        6    11
12     c        6    12
13     d        7    13
14     d        7    14
15     d        8    15
16     d        8    16

I want to first sum the values for each class/subclass, then take the median value for each class among all the subclasses.

I.e., the intermediate step would sum the values for each subclass for each class, and would look like this (note that I don't need to keep the data from this intermediate step):

> data.frame(class=rep(letters[1:4], each=2), subclass=1:8, sum=c(3,7,11,15,19,23,27,31))
  class subclass   sum
1     a        1     3
2     a        2     7
3     b        3    11
4     b        4    15
5     c        5    19
6     c        6    23
7     d        7    27
8     d        8    31

The second step would take the median for each class among all the subclasses, and would look like this:

> data.frame(class=letters[1:4], median=c(median(c(3,7)), median(c(11,15)), median(c(19,23)), median(c(27,31))))
  class median
1     a      5
2     b     13
3     c     21
4     d     29

This is the only data I need to keep. Note that both $class and $subclass will be factor variables, and value will always be a non-missing positive integer. Each class will have a varying number of subclasses.

I'm sure I can do this with some nasty for loops, but I was hoping for a better way that's vectorized and easier to maintain.

Upvotes: 1

Answers (3)

A5C1D2H2I1M1N2O1R2T1

Reputation: 193537

Here are two other alternatives.

The first uses ave within a within statement where we progressively reduce our source data.frame after adding in our aggregated data. Since this will result in many repeated rows, we can safely use unique as the last step to get the output you want.

unique(within(mydf, {
  Sum <- ave(value, class, subclass, FUN = sum)
  rm(subclass, value)
  Median <- ave(Sum, class, FUN = median)
  rm(Sum)
}))
#    class Median
# 1      a      5
# 5      b     13
# 9      c     21
# 13     d     29

A second option is to use the "data.table" package and "compound" your statements as below. V1 is the name that will be automatically created by data.table if a name is not specified by the user.

library(data.table)
DT <- data.table(mydf)
DT[, sum(value), by = c("class", "subclass")][, median(V1), by = "class"]
#    class V1
# 1:     a  5
# 2:     b 13
# 3:     c 21
# 4:     d 29

Upvotes: 2

iTech

Reputation: 18440

Here is another example of using aggregate

temp <- aggregate(df$value,list(class=df$class,subclass=df$subclass),sum)

aggregate(temp$x,list(class=temp$class),median)

Output:

      class  x
  1     a    5
  2     b   13
  3     c   21
  4     d   29

Or if you like a one-liner solution, you can do:

aggregate(value ~ class, median, data=aggregate(value ~ ., sum, data=df))

Upvotes: 3

Gary Weissman

Reputation: 3627

You could try for your first step:

df_sums <- aggregate(value ~ class + subclass, sum, data=df)

Then:

aggregate(value ~ class, median, data=df_sums)

Upvotes: 2

Operate over levels of two factors

Answers (3)

Related Questions