Reputation: 1985
In a dataset with multiple observations for each subject. For each subject I want to select the row which have the maximum value of 'pt'. For example, with a following dataset:
ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
Event <- c(1,1,2,1,2,1,2,2,2)
group <- data.frame(Subject=ID, pt=Value, Event=Event)
# Subject pt Event
# 1 1 2 1
# 2 1 3 1
# 3 1 5 2 # max 'pt' for Subject 1
# 4 2 2 1
# 5 2 5 2
# 6 2 8 1
# 7 2 17 2 # max 'pt' for Subject 2
# 8 3 3 2
# 9 3 5 2 # max 'pt' for Subject 3
Subject 1, 2, and 3 have the biggest pt value of 5, 17, and 5 respectively.
How could I first find the biggest pt value for each subject, and then, put this observation in another data frame? The resulting data frame should only have the biggest pt values for each subject.
Upvotes: 163
Views: 204123
Reputation: 39717
In base you can use ave
to get max
per group and compare this with pt
and get a logical vector to subset the data.frame
.
group[group$pt == ave(group$pt, group$Subject, FUN=max),]
# Subject pt Event
#3 1 5 2
#7 2 17 2
#9 3 5 2
Or using in addition with
.
group[with(group, pt == ave(pt, Subject, FUN=max)),]
Or compare it already in the function.
group[as.logical(ave(group$pt, group$Subject, FUN=function(x) x==max(x))),]
#group[ave(group$pt, group$Subject, FUN=function(x) x==max(x))==1,] #Variant
Upvotes: 4
Reputation: 6312
Since {dplyr} v1.0.0 (May 2020) there is the new slice_*
syntax which supersedes top_n()
.
See also https://dplyr.tidyverse.org/reference/slice.html.
library(tidyverse)
ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
Event <- c(1,1,2,1,2,1,2,2,2)
group <- data.frame(Subject=ID, pt=Value, Event=Event)
group %>%
group_by(Subject) %>%
slice_max(pt)
#> # A tibble: 3 x 3
#> # Groups: Subject [3]
#> Subject pt Event
#> <dbl> <dbl> <dbl>
#> 1 1 5 2
#> 2 2 17 2
#> 3 3 5 2
Created on 2020-08-18 by the reprex package (v0.3.0.9001)
Upvotes: 24
Reputation: 887741
Another option is slice
library(dplyr)
group %>%
group_by(Subject) %>%
slice(which.max(pt))
# Subject pt Event
# <dbl> <dbl> <dbl>
#1 1 5 2
#2 2 17 2
#3 3 5 2
Using dplyr
1.1.0
slice_max(group, pt, by = 'Subject')
Upvotes: 32
Reputation: 52319
With dplyr 1.1.0
, you can use slice_max
with by
to perform inline grouped slicing:
library(dplyr)
group %>%
slice_max(pt, n = 1, by = Subject)
# Subject pt Event
#1 1 5 2
#2 2 17 2
#3 3 5 2
Upvotes: 0
Reputation: 1290
I sometimes fall back on a solution around the order
function, as this then works for any type of column (ie. not just numeric). Basically, order the table by whatever you want to max or min, and then take the first or last of each group.
I also find this to be fairly easily readable for data.table
beginners.
library(data.table)
setDT(group)[order(pt)][, last(.SD), by = Subject]
Note that replacing last
with first
achieves the equivalent of the minimum value.
Upvotes: 0
Reputation: 1301
The most intuitive method is to use group_by
and top_n
function in dplyr
group %>% group_by(Subject) %>% top_n(1, pt)
The result you get is
Source: local data frame [3 x 3]
Groups: Subject [3]
Subject pt Event
(dbl) (dbl) (dbl)
1 1 5 2
2 2 17 2
3 3 5 2
Upvotes: 118
Reputation: 499
Using dplyr 1.0.2 there are now two ways to do this, one is long hand and the other is using the verb across():
# create data
ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
Event <- c(1,1,2,1,2,1,2,2,2)
group <- data.frame(Subject=ID, pt=Value, Event=Event)
Long hand the verb is max() but note the na.rm = TRUE which is useful for examples where there are NAs as in the closed question: Merge rows in a dataframe where the rows are disjoint and contain NAs:
group %>%
group_by(Subject) %>%
summarise(pt = max(pt, na.rm = TRUE),
Event = max(Event, na.rm = TRUE))
This is ok if there are only a few columns but if the table has many columns across() is useful. The examples for this verb are often with summarise(across(start_with... but in this example the columns don't start with the same characters. Either they could be changed or the positions listed:
group %>%
group_by(Subject) %>%
summarise(across(1:ncol(group)-1, max, na.rm = TRUE, .names = "{.col}"))
Note for the verb across() 1 refers to the first column after the first actual column so using ncol(group) won't work as that is too many columns (makes it position 4 rather than 3).
Upvotes: 0
Reputation: 9656
by
is a version of tapply
for data frames:
res <- by(group, group$Subject, FUN=function(df) df[which.max(df$pt),])
It returns an object of class by
so we convert it to data frame:
do.call(rbind, b)
Subject pt Event
1 1 5 2
2 2 17 2
3 3 5 2
Upvotes: 1
Reputation: 349
Another data.table
solution:
library(data.table)
setDT(group)[, head(.SD[order(-pt)], 1), by = .(Subject)]
Upvotes: 1
Reputation: 25225
Another data.table
option:
library(data.table)
setDT(group)
group[group[order(-pt), .I[1L], Subject]$V1]
Or another (less readable but slightly faster):
group[group[, rn := .I][order(Subject, -pt), {
rn[c(1L, 1L + which(diff(Subject)>0L))]
}]]
timing code:
library(data.table)
nr <- 1e7L
ng <- nr/4L
set.seed(0L)
DT <- data.table(Subject=sample(ng, nr, TRUE), pt=1:nr)#rnorm(nr))
DT2 <- copy(DT)
microbenchmark::microbenchmark(times=3L,
mtd0 = {a0 <- DT[DT[, .I[which.max(pt)], by=Subject]$V1]},
mtd1 = {a1 <- DT[DT[order(-pt), .I[1L], Subject]$V1]},
mtd2 = {a2 <- DT2[DT2[, rn := .I][
order(Subject, -pt), rn[c(TRUE, diff(Subject)>0L)]
]]},
mtd3 = {a3 <- unique(DT[order(Subject, -pt)], by="Subject")}
)
fsetequal(a0[order(Subject)], a1[order(Subject)])
#[1] TRUE
fsetequal(a0[order(Subject)], a2[, rn := NULL][order(Subject)])
#[1] TRUE
fsetequal(a0[order(Subject)], a3[order(Subject)])
#[1] TRUE
timings:
Unit: seconds
expr min lq mean median uq max neval
mtd0 3.256322 3.335412 3.371439 3.414502 3.428998 3.443493 3
mtd1 1.733162 1.748538 1.786033 1.763915 1.812468 1.861022 3
mtd2 1.136307 1.159606 1.207009 1.182905 1.242359 1.301814 3
mtd3 1.123064 1.166161 1.228058 1.209257 1.280554 1.351851 3
Upvotes: 1
Reputation: 1169
Another base solution
group_sorted <- group[order(group$Subject, -group$pt),]
group_sorted[!duplicated(group_sorted$Subject),]
# Subject pt Event
# 1 5 2
# 2 17 2
# 3 5 2
Order the data frame by pt
(descending) and then remove rows duplicated in Subject
Upvotes: 7
Reputation: 33603
One more base R solution:
merge(aggregate(pt ~ Subject, max, data = group), group)
Subject pt Event
1 1 5 2
2 2 17 2
3 3 5 2
Upvotes: 12
Reputation: 21
Here's another data.table
solution, since which.max
does not work on characters
library(data.table)
group <- data.table(Subject=ID, pt=Value, Event=Event)
group[, .SD[order(pt, decreasing = TRUE) == 1], by = Subject]
Upvotes: 2
Reputation: 1363
A dplyr
solution:
library(dplyr)
ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
Event <- c(1,1,2,1,2,1,2,2,2)
group <- data.frame(Subject=ID, pt=Value, Event=Event)
group %>%
group_by(Subject) %>%
summarize(max.pt = max(pt))
This yields the following data frame:
Subject max.pt
1 1 5
2 2 17
3 3 5
Upvotes: 19
Reputation: 51
If you want the biggest pt value for a subject, you could simply use:
pt_max = as.data.frame(aggregate(pt~Subject, group, max))
Upvotes: -1
Reputation: 659
do.call(rbind, lapply(split(group,as.factor(group$Subject)), function(x) {return(x[which.max(x$pt),])}))
Using Base R
Upvotes: 12
Reputation: 601
A shorter solution using data.table
:
setDT(group)[, .SD[which.max(pt)], by=Subject]
# Subject pt Event
# 1: 1 5 2
# 2: 2 17 2
# 3: 3 5 2
Upvotes: 60
Reputation: 118879
Here's a data.table
solution:
require(data.table) ## 1.9.2
group <- as.data.table(group)
If you want to keep all the entries corresponding to max values of pt
within each group:
group[group[, .I[pt == max(pt)], by=Subject]$V1]
# Subject pt Event
# 1: 1 5 2
# 2: 2 17 2
# 3: 3 5 2
If you'd like just the first max value of pt
:
group[group[, .I[which.max(pt)], by=Subject]$V1]
# Subject pt Event
# 1: 1 5 2
# 2: 2 17 2
# 3: 3 5 2
In this case, it doesn't make a difference, as there aren't multiple maximum values within any group in your data.
Upvotes: 147
Reputation: 206516
I wasn't sure what you wanted to do about the Event column, but if you want to keep that as well, how about
isIDmax <- with(dd, ave(Value, ID, FUN=function(x) seq_along(x)==which.max(x)))==1
group[isIDmax, ]
# ID Value Event
# 3 1 5 2
# 7 2 17 2
# 9 3 5 2
Here we use ave
to look at the "Value" column for each "ID". Then we determine which value is the maximal and then turn that into a logical vector we can use to subset the original data.frame.
Upvotes: 8