Reputation: 2604
I have a dataset that looks something like this, with many classes, each with many (5-10) subclasses, each with a value associated with it:
> data.frame(class=rep(letters[1:4], each=4), subclass=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8), value=1:16)
class subclass value
1 a 1 1
2 a 1 2
3 a 2 3
4 a 2 4
5 b 3 5
6 b 3 6
7 b 4 7
8 b 4 8
9 c 5 9
10 c 5 10
11 c 6 11
12 c 6 12
13 d 7 13
14 d 7 14
15 d 8 15
16 d 8 16
I want to first sum the values for each class/subclass, then take the median value for each class among all the subclasses.
I.e., the intermediate step would sum the values for each subclass for each class, and would look like this (note that I don't need to keep the data from this intermediate step):
> data.frame(class=rep(letters[1:4], each=2), subclass=1:8, sum=c(3,7,11,15,19,23,27,31))
class subclass sum
1 a 1 3
2 a 2 7
3 b 3 11
4 b 4 15
5 c 5 19
6 c 6 23
7 d 7 27
8 d 8 31
The second step would take the median for each class among all the subclasses, and would look like this:
> data.frame(class=letters[1:4], median=c(median(c(3,7)), median(c(11,15)), median(c(19,23)), median(c(27,31))))
class median
1 a 5
2 b 13
3 c 21
4 d 29
This is the only data I need to keep. Note that both $class and $subclass will be factor variables, and value will always be a non-missing positive integer. Each class will have a varying number of subclasses.
I'm sure I can do this with some nasty for loops, but I was hoping for a better way that's vectorized and easier to maintain.
Upvotes: 1
Views: 128
Reputation: 193537
Here are two other alternatives.
The first uses ave
within a within
statement where we progressively reduce our source data.frame
after adding in our aggregated data. Since this will result in many repeated rows, we can safely use unique
as the last step to get the output you want.
unique(within(mydf, {
Sum <- ave(value, class, subclass, FUN = sum)
rm(subclass, value)
Median <- ave(Sum, class, FUN = median)
rm(Sum)
}))
# class Median
# 1 a 5
# 5 b 13
# 9 c 21
# 13 d 29
A second option is to use the "data.table" package and "compound" your statements as below. V1
is the name that will be automatically created by data.table
if a name is not specified by the user.
library(data.table)
DT <- data.table(mydf)
DT[, sum(value), by = c("class", "subclass")][, median(V1), by = "class"]
# class V1
# 1: a 5
# 2: b 13
# 3: c 21
# 4: d 29
Upvotes: 2
Reputation: 18440
Here is another example of using aggregate
temp <- aggregate(df$value,list(class=df$class,subclass=df$subclass),sum)
aggregate(temp$x,list(class=temp$class),median)
Output:
class x
1 a 5
2 b 13
3 c 21
4 d 29
Or if you like a one-liner solution, you can do:
aggregate(value ~ class, median, data=aggregate(value ~ ., sum, data=df))
Upvotes: 3
Reputation: 3627
You could try for your first step:
df_sums <- aggregate(value ~ class + subclass, sum, data=df)
Then:
aggregate(value ~ class, median, data=df_sums)
Upvotes: 2