user1165199
user1165199

Reputation: 6649

Mean of top x entries of subset in R

Say I had the dataframe

df <- data.frame('A' = c('a','a','a','a','b','b','b','b','b'),
                 'B' = c('y','y','z','z','y','y','y','z','z'),
                 'value'=c(1  , 2 , 2 , 3 , 2 , 3 , 1 , 2 , 2))

so it looked like this

 A B value  
 a y     1  
 a y     2  
 a z     2  
 a z     3  
 b y     2  
 b y     3  
 b y     1   
 b z     2   
 b z     2  

I could get the mean of each subset of A and B using the query

with(df, aggregate(df, by = list(A, B), FUN = mean))

which after a bit of manipulation gives

A B value  
a y   1.5  
b y   2.0  
a z   2.5  
b z   2.0  

Is there are way of doing this but only calculating the mean of the highest x values in each subset. So if we take x as 2 in this example the mean of the subsets ay, az, and bz would not change as they only have a total of two entries (so the top x entries are the entire dataset of the subset). However by has three entries so we would want to return the mean of the highest two values (2 and 3) so that the output table would look like

A B value  
a y   1.5  
b y   2.5  
a z   2.5  
b z   2.0  

Upvotes: 1

Views: 265

Answers (3)

johannes
johannes

Reputation: 14433

Does this help?

x <- 2
with(df, aggregate(df, by = list(A, B), FUN = function(x)
                                                 mean(x[1:x])))

Upvotes: 0

Gavin Simpson
Gavin Simpson

Reputation: 174803

To versions of same thing:

lapply(split(df, list(df$A, df$B)),
       function(x) mean(x[order(x$value, decreasing = TRUE), ][1:2, "value"]))

or

sapply(split(df, list(df$A, df$B)),
       function(x) mean(x[order(x$value, decreasing = TRUE), ][1:2, "value"]))

give the desired result:

> lapply(split(df, list(df$A, df$B),
+        function(x) mean(x[order(x$value, decreasing = TRUE), ][1:2, "value"]))
$a.y
[1] 1.5

$b.y
[1] 2.5

$a.z
[1] 2.5

$b.z
[1] 2

> sapply(split(df, list(df$A, df$B)),
+        function(x) mean(x[order(x$value, decreasing = TRUE), ][1:2, "value"]))
a.y b.y a.z b.z 
1.5 2.5 2.5 2.0

In real-world applications you might want to make the anonymous function a proper function and make it robust to cases where there are less then 2 rows in each subset. That is left as an exercise for the reader.

The anonymous function (or one very similar) I showed could just as easily be used with aggregate():

aggregate(value ~ A + B, data = df,
          FUN = function(x) mean(x[order(x, decreasing = TRUE)][1:2]))

e.g.:

> aggregate(value ~ A + B, data = df,
+           FUN = function(x) mean(x[order(x, decreasing = TRUE)][1:2]))
  A B value
1 a y   1.5
2 b y   2.5
3 a z   2.5
4 b z   2.0

but I'm old-school and often do these things by hand.

Upvotes: 2

Andrie
Andrie

Reputation: 179428

I find it easier to use the formula interface to aggregate, as follows:

Your original version:

aggregate(value~A+B, data=df, FUN = mean)
  A B value
1 a y   1.5
2 b y   2.0
3 a z   2.5
4 b z   2.0

You can get your desired version by using an anonymous function that computes the mean of the tail of the sorted values:

aggregate(value~A+B, data=df, FUN = function(x)mean(tail(sort(x), 2)))
  A B value
1 a y   1.5
2 b y   2.5
3 a z   2.5
4 b z   2.0

Upvotes: 2

Related Questions