Reputation: 6649
Say I had the dataframe
df <- data.frame('A' = c('a','a','a','a','b','b','b','b','b'),
'B' = c('y','y','z','z','y','y','y','z','z'),
'value'=c(1 , 2 , 2 , 3 , 2 , 3 , 1 , 2 , 2))
so it looked like this
A B value
a y 1
a y 2
a z 2
a z 3
b y 2
b y 3
b y 1
b z 2
b z 2
I could get the mean of each subset of A and B using the query
with(df, aggregate(df, by = list(A, B), FUN = mean))
which after a bit of manipulation gives
A B value
a y 1.5
b y 2.0
a z 2.5
b z 2.0
Is there are way of doing this but only calculating the mean of the highest x values in each subset. So if we take x as 2 in this example the mean of the subsets ay, az, and bz would not change as they only have a total of two entries (so the top x entries are the entire dataset of the subset). However by has three entries so we would want to return the mean of the highest two values (2 and 3) so that the output table would look like
A B value
a y 1.5
b y 2.5
a z 2.5
b z 2.0
Upvotes: 1
Views: 265
Reputation: 14433
Does this help?
x <- 2
with(df, aggregate(df, by = list(A, B), FUN = function(x)
mean(x[1:x])))
Upvotes: 0
Reputation: 174803
To versions of same thing:
lapply(split(df, list(df$A, df$B)),
function(x) mean(x[order(x$value, decreasing = TRUE), ][1:2, "value"]))
or
sapply(split(df, list(df$A, df$B)),
function(x) mean(x[order(x$value, decreasing = TRUE), ][1:2, "value"]))
give the desired result:
> lapply(split(df, list(df$A, df$B),
+ function(x) mean(x[order(x$value, decreasing = TRUE), ][1:2, "value"]))
$a.y
[1] 1.5
$b.y
[1] 2.5
$a.z
[1] 2.5
$b.z
[1] 2
> sapply(split(df, list(df$A, df$B)),
+ function(x) mean(x[order(x$value, decreasing = TRUE), ][1:2, "value"]))
a.y b.y a.z b.z
1.5 2.5 2.5 2.0
In real-world applications you might want to make the anonymous function a proper function and make it robust to cases where there are less then 2 rows in each subset. That is left as an exercise for the reader.
The anonymous function (or one very similar) I showed could just as easily be used with aggregate()
:
aggregate(value ~ A + B, data = df,
FUN = function(x) mean(x[order(x, decreasing = TRUE)][1:2]))
e.g.:
> aggregate(value ~ A + B, data = df,
+ FUN = function(x) mean(x[order(x, decreasing = TRUE)][1:2]))
A B value
1 a y 1.5
2 b y 2.5
3 a z 2.5
4 b z 2.0
but I'm old-school and often do these things by hand.
Upvotes: 2
Reputation: 179428
I find it easier to use the formula interface to aggregate
, as follows:
Your original version:
aggregate(value~A+B, data=df, FUN = mean)
A B value
1 a y 1.5
2 b y 2.0
3 a z 2.5
4 b z 2.0
You can get your desired version by using an anonymous function that computes the mean of the tail of the sorted values:
aggregate(value~A+B, data=df, FUN = function(x)mean(tail(sort(x), 2)))
A B value
1 a y 1.5
2 b y 2.5
3 a z 2.5
4 b z 2.0
Upvotes: 2