jmogil
jmogil

Reputation: 153

Extract row corresponding to maximum value by group for multiple variables

I have a data frame grouped by ID with multiple rows per ID and several variables a, b, c, etc.

Here is a toy example:

dt <- structure(list(ID = c(1, 1, 2, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 
5, 5, 6, 6, 6, 6, 6, 6, 7, 8, 8, 8, 8, 9, 9, 9, 10, 10), a = c(1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1), b = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1), c = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), d = c(1, 1, 0, 0, 
0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 
0, 0, 0, 0, 1, 1), e = c(0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1), f = c(1, 
1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 
0, 1, 1, 1, 0, 1, 1, 1, 1), g = c(1, 1, 1, 1, 0, 0, 0, 0, 1, 
1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1), h = c(1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1), i = c(1, 1, 1, 1, 
1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 
0, 0, 0, 0, 1, 1)), row.names = c(NA, -31L), class = c("tbl_df", 
"tbl", "data.frame"))

For each ID, I want to extract the row that represents the maximum value for each variable (taking first or last instance of max value is not a concern). There are plenty of examples of how to do this when only considering one variable for example. However, I am having difficulty applying this across multiple variables.

Here was my attempt to solve the problem (using data.table and lapply):

library(data.table)
setDT(dt)
variables = colnames(dt[, 2:10])
dt_max = dt[, lapply(.SD, which.max), .SDcols = variables, by = "ID"]

Looking at what this produces, it appears that values are being summed rather than the maximum value being extracted for each ID:

    ID a b c d e f g h i
 1:  1 1 1 1 1 2 1 1 1 1
 2:  2 1 1 1 1 1 1 1 1 1
 3:  3 1 1 1 1 5 1 1 2 1
 4:  4 1 1 1 1 1 1 1 1 1
 5:  5 1 1 1 1 1 3 1 1 1
 6:  6 1 1 1 1 1 1 1 1 1
 7:  7 1 1 1 1 1 1 1 1 1
 8:  8 1 1 1 1 1 2 1 2 2
 9:  9 1 1 1 1 1 2 1 1 1
10: 10 1 1 1 1 1 1 1 1 1

This is my desired/expected output:

    ID a b c d e f g h i
 1:  1 1 1 1 1 1 1 1 1 1
 2:  2 1 1 1 0 0 1 1 0 1
 3:  3 1 1 1 0 1 1 1 1 1
 4:  4 1 1 1 0 0 1 1 0 0
 5:  5 1 1 1 1 1 1 1 0 0
 6:  6 1 1 1 1 1 1 1 0 1
 7:  7 1 1 1 1 1 0 1 0 0
 8:  8 1 1 1 1 0 1 1 1 1
 9:  9 1 1 1 0 1 1 1 0 0
10: 10 1 1 1 1 1 1 1 1 1

I am at a loss for why this would be happening. My only other thought would be to do this the long way for each variable separately, then merging the results together. But this seems like a very inefficient way to solve the problem.

Any help would be greatly appreciated!

Upvotes: 0

Views: 247

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388817

max and which.max are two different functions doing different things. max would give the max value in a vector whereas which.max would give position of the max value in the vector.

x <- 4:1

max(x)
#[1] 4
which.max(x)
#[1] 1

Here which.max returns 1 because 4 is present at the 1st position in the vector x.

So if you need max values in multiple columns, you should use max and not which.max.

library(data.table)
setDT(dt)
variables = colnames(dt[, 2:10])

dt[, lapply(.SD, max), .SDcols = variables, ID]

#    ID a b c d e f g h i
# 1:  1 1 1 1 1 1 1 1 1 1
# 2:  2 1 1 1 0 0 1 1 0 1
# 3:  3 1 1 1 0 1 1 1 1 1
# 4:  4 1 1 1 0 0 1 1 0 0
# 5:  5 1 1 1 1 1 1 1 0 0
# 6:  6 1 1 1 1 1 1 1 0 1
# 7:  7 1 1 1 1 1 0 1 0 0
# 8:  8 1 1 1 1 0 1 1 1 1
# 9:  9 1 1 1 0 1 1 1 0 0
#10: 10 1 1 1 1 1 1 1 1 1

Upvotes: 1

Related Questions