Reputation: 6094
This is in response to a question asked on the r-help mailing list.
Here are lots of examples of how to find top values by group using sql
, so I imagine it's easy to convert that knowledge over using the R sqldf
package.
An example: when mtcars
is grouped by cyl
, here are the top three records for each distinct value of cyl
. Note that ties are excluded in this case, but it'd be nice to show some different ways to treat ties.
mpg cyl disp hp drat wt qsec vs am gear carb ranks
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 2.0
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 1.0
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 2.0
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 3.0
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 1.0
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 1.5
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 1.5
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 3.0
How to find the top or bottom (maximum or minimum) N records per group?
Upvotes: 70
Views: 166656
Reputation: 99
data.table
way for picking the lowest 3 mpg per group:
data("mtcars")
setDT(mtcars)[order(mpg), head(.SD, 3), by = "cyl"]
Upvotes: 1
Reputation: 39858
Since dplyr 1.0.0
, the slice_max()
/slice_min()
functions were implemented:
mtcars %>%
group_by(cyl) %>%
slice_max(mpg, n = 2, with_ties = FALSE)
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
2 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
3 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
5 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2
6 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
The documentation on with_ties
parameter:
Should ties be kept together? The default, TRUE, may return more rows than you request. Use FALSE to ignore ties, and return the first n rows.
Upvotes: 3
Reputation: 1474
There are at least 4 ways to do this thing, however,each has some difference. We using u_id to group and using lift value to order/sort
1 dplyr traditional way
library(dplyr)
top10_final_subset1 = final_subset %>% arrange(desc(lift)) %>% group_by(u_id) %>% slice(1:10)
and if you switch the order of arrange(desc(lift)) and group_by(u_id) the result is essential the same.And if there is tie for equal lift value,it will slice to make sure each group has no more than 10 values, if you only have 5 lift value in the group, it will only gives you 5 results for that group.
2 dplyr topN way
library(dplyr)
top10_final_subset2 = final_subset %>% group_by(u_id) %>% top_n(10,lift)
this one if you have tie in lift value, say 15 same lift for the same u_id, you will got all 15 observations
3 data.table tail way
library(data.table)
final_subset = data.table(final_subset,key = "lift")
top10_final_subset3 = final_subset[,tail(.SD,10),,by = c("u_id")]
It has the same row numbers as the first way, however, there are some rows are different, I guess they are using diff random algorithm dealing with tie.
4 data.table .SD way
library(data.table)
top10_final_subset4 = final_subset[,.SD[order(lift,decreasing = TRUE),][1:10],by = "u_id"]
This way is the most "uniform" way,if in a group there are only 5 observation it will repeat value to make it to 10 observations and if there are ties it will still slice and only hold for 10 observations.
Upvotes: 10
Reputation: 696
dplyr
does the trick
mtcars %>%
arrange(desc(mpg)) %>%
group_by(cyl) %>% slice(1:2)
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
2 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
5 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
6 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Upvotes: 35
Reputation: 1765
I prefer @Ista solution, cause needs no extra package and is simple.
A modification of the data.table
solution also solve my problem, and is more general.
My data.frame is
> str(df)
'data.frame': 579 obs. of 11 variables:
$ trees : num 2000 5000 1000 2000 1000 1000 2000 5000 5000 1000 ...
$ interDepth: num 2 3 5 2 3 4 4 2 3 5 ...
$ minObs : num 6 4 1 4 10 6 10 10 6 6 ...
$ shrinkage : num 0.01 0.001 0.01 0.005 0.01 0.01 0.001 0.005 0.005 0.001 ...
$ G1 : num 0 2 2 2 2 2 8 8 8 8 ...
$ G2 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ qx : num 0.44 0.43 0.419 0.439 0.43 ...
$ efet : num 43.1 40.6 39.9 39.2 38.6 ...
$ prec : num 0.606 0.593 0.587 0.582 0.574 0.578 0.576 0.579 0.588 0.585 ...
$ sens : num 0.575 0.57 0.573 0.575 0.587 0.574 0.576 0.566 0.542 0.545 ...
$ acu : num 0.631 0.645 0.647 0.648 0.655 0.647 0.619 0.611 0.591 0.594 ...
The data.table
solution needs order
on i
to do the job:
> require(data.table)
> dt1 <- data.table(df)
> dt2 = dt1[order(-efet, G1, G2), head(.SD, 3), by = .(G1, G2)]
> dt2
G1 G2 trees interDepth minObs shrinkage qx efet prec sens acu
1: 0 FALSE 2000 2 6 0.010 0.4395953 43.066 0.606 0.575 0.631
2: 0 FALSE 2000 5 1 0.005 0.4294718 37.554 0.583 0.548 0.607
3: 0 FALSE 5000 2 6 0.005 0.4395753 36.981 0.575 0.559 0.616
4: 2 FALSE 5000 3 4 0.001 0.4296346 40.624 0.593 0.570 0.645
5: 2 FALSE 1000 5 1 0.010 0.4186802 39.915 0.587 0.573 0.647
6: 2 FALSE 2000 2 4 0.005 0.4390503 39.164 0.582 0.575 0.648
7: 8 FALSE 2000 4 10 0.001 0.4511349 38.240 0.576 0.576 0.619
8: 8 FALSE 5000 2 10 0.005 0.4469665 38.064 0.579 0.566 0.611
9: 8 FALSE 5000 3 6 0.005 0.4426952 37.888 0.588 0.542 0.591
10: 2 TRUE 5000 3 4 0.001 0.3812878 21.057 0.510 0.479 0.615
11: 2 TRUE 2000 3 10 0.005 0.3790536 20.127 0.507 0.470 0.608
12: 2 TRUE 1000 5 4 0.001 0.3690911 18.981 0.500 0.475 0.611
13: 8 TRUE 5000 6 10 0.010 0.2865042 16.870 0.497 0.435 0.635
14: 0 TRUE 2000 6 4 0.010 0.3192862 9.779 0.460 0.433 0.621
By some reason, it does not order the way pointed (probably because ordering by the groups). So, another ordering is done.
> dt2[order(G1, G2)]
G1 G2 trees interDepth minObs shrinkage qx efet prec sens acu
1: 0 FALSE 2000 2 6 0.010 0.4395953 43.066 0.606 0.575 0.631
2: 0 FALSE 2000 5 1 0.005 0.4294718 37.554 0.583 0.548 0.607
3: 0 FALSE 5000 2 6 0.005 0.4395753 36.981 0.575 0.559 0.616
4: 0 TRUE 2000 6 4 0.010 0.3192862 9.779 0.460 0.433 0.621
5: 2 FALSE 5000 3 4 0.001 0.4296346 40.624 0.593 0.570 0.645
6: 2 FALSE 1000 5 1 0.010 0.4186802 39.915 0.587 0.573 0.647
7: 2 FALSE 2000 2 4 0.005 0.4390503 39.164 0.582 0.575 0.648
8: 2 TRUE 5000 3 4 0.001 0.3812878 21.057 0.510 0.479 0.615
9: 2 TRUE 2000 3 10 0.005 0.3790536 20.127 0.507 0.470 0.608
10: 2 TRUE 1000 5 4 0.001 0.3690911 18.981 0.500 0.475 0.611
11: 8 FALSE 2000 4 10 0.001 0.4511349 38.240 0.576 0.576 0.619
12: 8 FALSE 5000 2 10 0.005 0.4469665 38.064 0.579 0.566 0.611
13: 8 FALSE 5000 3 6 0.005 0.4426952 37.888 0.588 0.542 0.591
14: 8 TRUE 5000 6 10 0.010 0.2865042 16.870 0.497 0.435 0.635
Upvotes: 2
Reputation: 321
You can write a function that splits the database by a factor, orders by another desired variable, extract the number of rows you want in each factor (category) and combine these into a database.
top<-function(x, num, c1,c2){
sorted<-x[with(x,order(x[,c1],x[,c2],decreasing=T)),]
splits<-split(sorted,sorted[,c1])
df<-lapply(splits,head,num)
do.call(rbind.data.frame,df)}
x is the dataframe;
num is the number of number of rows you would like to see;
c1 is the column number of the variable you would like to split by;
c2 is the column number of the variable you would like to rank by or handle ties.
Using the mtcars data, the function extracts the 3 heaviest cars (mtcars$wt is the 6th column) in each cylinder class (mtcars$cyl is the 2nd column)
top(mtcars,3,2,6)
mpg cyl disp hp drat wt qsec vs am gear carb
4.Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
4.Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
4.Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
6.Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
6.Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
6.Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
8.Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
8.Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
8.Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
You can also easily get the lightest in a class by changing head in the lapply function to tail OR by removing the decreasing=T argument in the order function which will return it to its default, decreasing=F.
Upvotes: 3
Reputation: 118779
This seems more straightforward using data.table
as it performs the sort while setting the key.
So, if I were to get the top 3 records in sort (ascending order), then,
require(data.table)
d <- data.table(mtcars, key="cyl")
d[, head(.SD, 3), by=cyl]
does it.
And if you want the descending order
d[, tail(.SD, 3), by=cyl] # Thanks @MatthewDowle
Edit: To sort out ties using mpg
column:
d <- data.table(mtcars, key="cyl")
d.out <- d[, .SD[mpg %in% head(sort(unique(mpg)), 3)], by=cyl]
# cyl mpg disp hp drat wt qsec vs am gear carb rank
# 1: 4 22.8 108.0 93 3.85 2.320 18.61 1 1 4 1 11
# 2: 4 22.8 140.8 95 3.92 3.150 22.90 1 0 4 2 1
# 3: 4 21.5 120.1 97 3.70 2.465 20.01 1 0 3 1 8
# 4: 4 21.4 121.0 109 4.11 2.780 18.60 1 1 4 2 6
# 5: 6 18.1 225.0 105 2.76 3.460 20.22 1 0 3 1 7
# 6: 6 19.2 167.6 123 3.92 3.440 18.30 1 0 4 4 1
# 7: 6 17.8 167.6 123 3.92 3.440 18.90 1 0 4 4 2
# 8: 8 14.3 360.0 245 3.21 3.570 15.84 0 0 3 4 7
# 9: 8 10.4 472.0 205 2.93 5.250 17.98 0 0 3 4 14
# 10: 8 10.4 460.0 215 3.00 5.424 17.82 0 0 3 4 5
# 11: 8 13.3 350.0 245 3.73 3.840 15.41 0 0 3 4 3
# and for last N elements, of course it is straightforward
d.out <- d[, .SD[mpg %in% tail(sort(unique(mpg)), 3)], by=cyl]
Upvotes: 61
Reputation: 263301
If there were a tie at the fourth position for mtcars$mpg then this should return all the ties:
top_mpg <- mtcars[ mtcars$mpg >= mtcars$mpg[order(mtcars$mpg, decreasing=TRUE)][4] , ]
> top_mpg
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Since there is a tie at the 3-4 position you can test it by changing 4 to a 3, and it still returns 4 items. This is logical indexing and you might need to add a clause that removes the NA's or wrap which() around the logical expression. It's not much more difficult to do this "by" cyl:
Reduce(rbind, by(mtcars, mtcars$cyl,
function(d) d[ d$mpg >= d$mpg[order(d$mpg, decreasing=TRUE)][4] , ]) )
#-------------
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Incorporating my suggestion to @Ista:
Reduce(rbind, by(mtcars, mtcars$cyl, function(d) d[ d$mpg <= sort( d$mpg )[3] , ]) )
Upvotes: 4
Reputation: 10437
Just sort by whatever (mpg for example, question is not clear on this)
mt <- mtcars[order(mtcars$mpg), ]
then use the by function to get the top n rows in each group
d <- by(mt, mt["cyl"], head, n=4)
If you want the result to be a data.frame:
Reduce(rbind, d)
Edit: Handling ties is more difficult, but if all ties are desired:
by(mt, mt["cyl"], function(x) x[rank(x$mpg) %in% sort(unique(rank(x$mpg)))[1:4], ])
Another approach is to break ties based on some other information, e.g.,
mt <- mtcars[order(mtcars$mpg, mtcars$hp), ]
by(mt, mt["cyl"], head, n=4)
Upvotes: 25
Reputation: 6094
# start with the mtcars data frame (included with your installation of R)
mtcars
# pick your 'group by' variable
gbv <- 'cyl'
# IMPORTANT NOTE: you can only include one group by variable here
# ..if you need more, the `order` function below will need
# one per inputted parameter: order( x$cyl , x$am )
# choose whether you want to find the minimum or maximum
find.maximum <- FALSE
# create a simple data frame with only two columns
x <- mtcars
# order it based on
x <- x[ order( x[ , gbv ] , decreasing = find.maximum ) , ]
# figure out the ranks of each miles-per-gallon, within cyl columns
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank ) )
}
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]
# look at your results
result
# done!
# but note only *two* values where cyl == 4 were kept,
# because there was a tie for third smallest, and the `rank` function gave both '3.5'
x[ x$ranks == 3.5 , ]
# ..if you instead wanted to keep all ties, you could change the
# tie-breaking behavior of the `rank` function.
# using the `min` *includes* all ties. using `max` would *exclude* all ties
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) )
}
# and there are even more options..
# see ?rank for more methods
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]
# look at your results
result
# and notice *both* cyl == 4 and ranks == 3 were included in your results
# because of the tie-breaking behavior chosen.
Upvotes: 1