Reputation: 1985

Select the row with the maximum value in each group

In a dataset with multiple observations for each subject. For each subject I want to select the row which have the maximum value of 'pt'. For example, with a following dataset:

ID    <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
Event <- c(1,1,2,1,2,1,2,2,2)

group <- data.frame(Subject=ID, pt=Value, Event=Event)
#   Subject pt Event
# 1       1  2     1
# 2       1  3     1
# 3       1  5     2 # max 'pt' for Subject 1
# 4       2  2     1
# 5       2  5     2
# 6       2  8     1
# 7       2 17     2 # max 'pt' for Subject 2
# 8       3  3     2
# 9       3  5     2 # max 'pt' for Subject 3

Subject 1, 2, and 3 have the biggest pt value of 5, 17, and 5 respectively.

How could I first find the biggest pt value for each subject, and then, put this observation in another data frame? The resulting data frame should only have the biggest pt values for each subject.

Upvotes: 163

Answers (19)

GKi

Reputation: 39727

In base you can use ave to get max per group and compare this with pt and get a logical vector to subset the data.frame.

group[group$pt == ave(group$pt, group$Subject, FUN=max),]
#  Subject pt Event
#3       1  5     2
#7       2 17     2
#9       3  5     2

Or using in addition with.

group[with(group, pt == ave(pt, Subject, FUN=max)),]

Or compare it already in the function.

group[as.logical(ave(group$pt, group$Subject, FUN=function(x) x==max(x))),]
#group[ave(group$pt, group$Subject, FUN=function(x) x==max(x))==1,] #Variant

Upvotes: 4

pat-s

Reputation: 6312

Since {dplyr} v1.0.0 (May 2020) there is the new slice_* syntax which supersedes top_n().

library(tidyverse)

ID    <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
Event <- c(1,1,2,1,2,1,2,2,2)

group <- data.frame(Subject=ID, pt=Value, Event=Event)

group %>% 
  group_by(Subject) %>% 
  slice_max(pt)
#> # A tibble: 3 x 3
#> # Groups:   Subject [3]
#>   Subject    pt Event
#>     <dbl> <dbl> <dbl>
#> 1       1     5     2
#> 2       2    17     2
#> 3       3     5     2

^{Created on 2020-08-18 by the reprex package (v0.3.0.9001)}

Upvotes: 24

akrun

Reputation: 887941

Another option is slice

library(dplyr)
group %>%
     group_by(Subject) %>%
     slice(which.max(pt))
#    Subject    pt Event
#    <dbl> <dbl> <dbl>
#1       1     5     2
#2       2    17     2
#3       3     5     2

Using dplyr 1.1.0

slice_max(group, pt, by = 'Subject')

Upvotes: 32

Maël

Reputation: 52389

With dplyr 1.1.0, you can use slice_max with by to perform inline grouped slicing:

library(dplyr)
group %>% 
  slice_max(pt, n = 1, by = Subject)

#  Subject pt Event
#1       1  5     2
#2       2 17     2
#3       3  5     2

Upvotes: 0

U W

Reputation: 1290

I sometimes fall back on a solution around the order function, as this then works for any type of column (ie. not just numeric). Basically, order the table by whatever you want to max or min, and then take the first or last of each group.

I also find this to be fairly easily readable for data.table beginners.

library(data.table)

setDT(group)[order(pt)][, last(.SD), by = Subject]

Note that replacing last with first achieves the equivalent of the minimum value.

Upvotes: 0

Xi Liang

Reputation: 1301

The most intuitive method is to use group_by and top_n function in dplyr

group %>% group_by(Subject) %>% top_n(1, pt)

The result you get is

Source: local data frame [3 x 3]
Groups: Subject [3]

  Subject    pt Event
    (dbl) (dbl) (dbl)
1       1     5     2
2       2    17     2
3       3     5     2

Upvotes: 118

Zoë Turner

Reputation: 509

Using dplyr 1.0.2 there are now two ways to do this, one is long hand and the other is using the verb across():

      # create data
      ID    <- c(1,1,1,2,2,2,2,3,3)
      Value <- c(2,3,5,2,5,8,17,3,5)
      Event <- c(1,1,2,1,2,1,2,2,2)
      
      group <- data.frame(Subject=ID, pt=Value, Event=Event)

Long hand the verb is max() but note the na.rm = TRUE which is useful for examples where there are NAs as in the closed question: Merge rows in a dataframe where the rows are disjoint and contain NAs:

       group %>% 
        group_by(Subject) %>% 
        summarise(pt = max(pt, na.rm = TRUE),
                  Event = max(Event, na.rm = TRUE))

This is ok if there are only a few columns but if the table has many columns across() is useful. The examples for this verb are often with summarise(across(start_with... but in this example the columns don't start with the same characters. Either they could be changed or the positions listed:

    group %>% 
        group_by(Subject) %>% 
        summarise(across(1:ncol(group)-1, max, na.rm = TRUE, .names = "{.col}"))

Note for the verb across() 1 refers to the first column after the first actual column so using ncol(group) won't work as that is too many columns (makes it position 4 rather than 3).

Upvotes: 0

Karolis Koncevičius

Reputation: 9666

by is a version of tapply for data frames:

res <- by(group, group$Subject, FUN=function(df) df[which.max(df$pt),])

It returns an object of class by so we convert it to data frame:

do.call(rbind, b)
  Subject pt Event
1       1  5     2
2       2 17     2
3       3  5     2

Upvotes: 1

Vykta Wakandigara

Reputation: 349

Another data.table solution:

library(data.table)
setDT(group)[, head(.SD[order(-pt)], 1), by = .(Subject)]

Upvotes: 1

chinsoon12

Reputation: 25223

Another data.table option:

library(data.table)
setDT(group)
group[group[order(-pt), .I[1L], Subject]$V1]

Or another (less readable but slightly faster):

group[group[, rn := .I][order(Subject, -pt), {
    rn[c(1L, 1L + which(diff(Subject)>0L))]
}]]

timing code:

library(data.table)
nr <- 1e7L
ng <- nr/4L
set.seed(0L)
DT <- data.table(Subject=sample(ng, nr, TRUE), pt=1:nr)#rnorm(nr))
DT2 <- copy(DT)


microbenchmark::microbenchmark(times=3L,
    mtd0 = {a0 <- DT[DT[, .I[which.max(pt)], by=Subject]$V1]},
    mtd1 = {a1 <- DT[DT[order(-pt), .I[1L], Subject]$V1]},
    mtd2 = {a2 <- DT2[DT2[, rn := .I][
        order(Subject, -pt), rn[c(TRUE, diff(Subject)>0L)]
    ]]},
    mtd3 = {a3 <- unique(DT[order(Subject, -pt)], by="Subject")}
)
fsetequal(a0[order(Subject)], a1[order(Subject)])
#[1] TRUE
fsetequal(a0[order(Subject)], a2[, rn := NULL][order(Subject)])
#[1] TRUE
fsetequal(a0[order(Subject)], a3[order(Subject)])
#[1] TRUE

timings:

Unit: seconds
 expr      min       lq     mean   median       uq      max neval
 mtd0 3.256322 3.335412 3.371439 3.414502 3.428998 3.443493     3
 mtd1 1.733162 1.748538 1.786033 1.763915 1.812468 1.861022     3
 mtd2 1.136307 1.159606 1.207009 1.182905 1.242359 1.301814     3
 mtd3 1.123064 1.166161 1.228058 1.209257 1.280554 1.351851     3

Upvotes: 1

Ape

Reputation: 1169

Another base solution

group_sorted <- group[order(group$Subject, -group$pt),]
group_sorted[!duplicated(group_sorted$Subject),]

# Subject pt Event
#       1  5     2
#       2 17     2
#       3  5     2

Order the data frame by pt (descending) and then remove rows duplicated in Subject

Upvotes: 7

s_baldur

Reputation: 33733

One more base R solution:

merge(aggregate(pt ~ Subject, max, data = group), group)

  Subject pt Event
1       1  5     2
2       2 17     2
3       3  5     2

Upvotes: 12

Kyoma G

Reputation: 21

Here's another data.table solution, since which.max does not work on characters

library(data.table)
group <- data.table(Subject=ID, pt=Value, Event=Event)

group[, .SD[order(pt, decreasing = TRUE) == 1], by = Subject]

Upvotes: 2

tkmckenzie

Reputation: 1363

A dplyr solution:

library(dplyr)
ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
Event <- c(1,1,2,1,2,1,2,2,2)
group <- data.frame(Subject=ID, pt=Value, Event=Event)

group %>%
    group_by(Subject) %>%
    summarize(max.pt = max(pt))

This yields the following data frame:

  Subject max.pt
1       1      5
2       2     17
3       3      5

Upvotes: 19

Mutyalama

Reputation: 51

If you want the biggest pt value for a subject, you could simply use:

   pt_max = as.data.frame(aggregate(pt~Subject, group, max))

Upvotes: -1

Kalees Waran

Reputation: 659

do.call(rbind, lapply(split(group,as.factor(group$Subject)), function(x) {return(x[which.max(x$pt),])}))

Using Base R

Upvotes: 12

Mark Chamness

Reputation: 601

A shorter solution using data.table:

setDT(group)[, .SD[which.max(pt)], by=Subject]
#    Subject pt Event
# 1:       1  5     2
# 2:       2 17     2
# 3:       3  5     2

Upvotes: 60

Arun

Reputation: 118889

Here's a data.table solution:

require(data.table) ## 1.9.2
group <- as.data.table(group)

If you want to keep all the entries corresponding to max values of pt within each group:

group[group[, .I[pt == max(pt)], by=Subject]$V1]
#    Subject pt Event
# 1:       1  5     2
# 2:       2 17     2
# 3:       3  5     2

If you'd like just the first max value of pt:

group[group[, .I[which.max(pt)], by=Subject]$V1]
#    Subject pt Event
# 1:       1  5     2
# 2:       2 17     2
# 3:       3  5     2

In this case, it doesn't make a difference, as there aren't multiple maximum values within any group in your data.

Upvotes: 147

MrFlick

Reputation: 206586

I wasn't sure what you wanted to do about the Event column, but if you want to keep that as well, how about

isIDmax <- with(dd, ave(Value, ID, FUN=function(x) seq_along(x)==which.max(x)))==1
group[isIDmax, ]

#   ID Value Event
# 3  1     5     2
# 7  2    17     2
# 9  3     5     2

Here we use ave to look at the "Value" column for each "ID". Then we determine which value is the maximal and then turn that into a logical vector we can use to subset the original data.frame.

Upvotes: 8

Select the row with the maximum value in each group

Answers (19)

Related Questions