Reputation: 177
I am trying to find the minimum value among different columns and group. A small sample of my data looks something like this:
group cut group_score_1 group_score_2
1 a 1 3 5.0
2 b 2 2 4.0
3 a 0 2 2.5
4 b 3 5 4.0
5 a 2 3 6.0
6 b 1 5 1.0
I want to group by the groups and for each group, find the row which contains the minimum group score among both group scores and then also get the name of the column which contains the minimum (group_score_1 or group_score_2), so basically my result should be something like this:
group cut group_score_1 group_score_2
1 a 0 2 2.5
2 b 1 5 1.0
I tried a few ideas, and came up eventually to dividing the into several new data frames, filtering by group and selecting the relevant columns and then using which.min()
, but I'm sure there's a much more efficient way to do it. Not sure what I am missing.
Upvotes: 2
Views: 1052
Reputation: 887118
We can use data.table
methods
library(data.table)
setDT(df)[df[, .I[which.min(do.call(pmin, .SD))],
group, .SDcols = patterns('^group_score')]$V1]
# group cut group_score_1 group_score_2
#1: a 0 2 2.5
#2: b 1 5 1.0
Upvotes: 1
Reputation: 101343
Here is a base R option using pmin
+ ave
+ subset
subset(
df,
as.logical(ave(
do.call(pmin, df[grep("group_score_\\d+", names(df))]),
group,
FUN = function(x) x == min(x)
))
)
which gives
group cut group_score_1 group_score_2
3 a 0 2 2.5
6 b 1 5 1.0
Data
> dput(df)
structure(list(group = c("a", "b", "a", "b", "a", "b"), cut = c(1L,
2L, 0L, 3L, 2L, 1L), group_score_1 = c(3L, 2L, 2L, 5L, 3L, 5L
), group_score_2 = c(5, 4, 2.5, 4, 6, 1)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
Upvotes: 0
Reputation: 388982
For each group
, you can calculate min
value and select the row in which that value exist in one of the column.
library(dplyr)
df %>%
group_by(group) %>%
filter({tmp = min(group_score_1, group_score_2);
group_score_1 == tmp | group_score_2 == tmp})
# group cut group_score_1 group_score_2
# <chr> <int> <int> <dbl>
#1 a 0 2 2.5
#2 b 1 5 1
The above works well when you have only two group_score
columns. If you have many such columns it is not possible to list down each one of them with group_score_1 == tmp | group_score_2 == tmp
etc. In such case, get the data in long format and get the corresponding cut
value of the minimum value and join the data. Assuming cut
is unique in each group.
df %>%
tidyr::pivot_longer(cols = starts_with('group_score')) %>%
group_by(group) %>%
summarise(cut = cut[which.min(value)]) %>%
left_join(df, by = c("group", "cut"))
Upvotes: 1