Amidavid
Amidavid

Reputation: 177

Finding minimum by groups and among columns

I am trying to find the minimum value among different columns and group. A small sample of my data looks something like this:

     group cut group_score_1 group_score_2
1        a   1             3           5.0
2        b   2             2           4.0
3        a   0             2           2.5
4        b   3             5           4.0
5        a   2             3           6.0
6        b   1             5           1.0

I want to group by the groups and for each group, find the row which contains the minimum group score among both group scores and then also get the name of the column which contains the minimum (group_score_1 or group_score_2), so basically my result should be something like this:

      group cut group_score_1 group_score_2
1        a   0             2           2.5
2        b   1             5           1.0

I tried a few ideas, and came up eventually to dividing the into several new data frames, filtering by group and selecting the relevant columns and then using which.min(), but I'm sure there's a much more efficient way to do it. Not sure what I am missing.

Upvotes: 2

Views: 1052

Answers (3)

akrun
akrun

Reputation: 887118

We can use data.table methods

library(data.table)
setDT(df)[df[, .I[which.min(do.call(pmin, .SD))], 
     group, .SDcols = patterns('^group_score')]$V1]
#   group cut group_score_1 group_score_2
#1:     a   0             2           2.5
#2:     b   1             5           1.0

Upvotes: 1

ThomasIsCoding
ThomasIsCoding

Reputation: 101343

Here is a base R option using pmin + ave + subset

subset(
  df,
  as.logical(ave(
    do.call(pmin, df[grep("group_score_\\d+", names(df))]),
    group,
    FUN = function(x) x == min(x)
  ))
)

which gives

  group cut group_score_1 group_score_2
3     a   0             2           2.5
6     b   1             5           1.0

Data

> dput(df)
structure(list(group = c("a", "b", "a", "b", "a", "b"), cut = c(1L, 
2L, 0L, 3L, 2L, 1L), group_score_1 = c(3L, 2L, 2L, 5L, 3L, 5L
), group_score_2 = c(5, 4, 2.5, 4, 6, 1)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 388982

For each group, you can calculate min value and select the row in which that value exist in one of the column.

library(dplyr)

df %>%
  group_by(group) %>%
  filter({tmp = min(group_score_1, group_score_2);
          group_score_1 == tmp | group_score_2 == tmp})

#  group   cut group_score_1 group_score_2
#  <chr> <int>         <int>         <dbl>
#1 a         0             2           2.5
#2 b         1             5           1  

The above works well when you have only two group_score columns. If you have many such columns it is not possible to list down each one of them with group_score_1 == tmp | group_score_2 == tmp etc. In such case, get the data in long format and get the corresponding cut value of the minimum value and join the data. Assuming cut is unique in each group.

df %>%
  tidyr::pivot_longer(cols = starts_with('group_score')) %>%
  group_by(group) %>%
  summarise(cut = cut[which.min(value)]) %>%
  left_join(df, by = c("group", "cut"))

Upvotes: 1

Related Questions