Reputation: 1365

For each big group select the best sub-group available in R

I have a data set which has a some big groups, and subgroups (small groups).

I want to select small group 1 for each big group. But, if small group one doesn't exist in big group, select subgroup 2. My example below stops here, but ideally this would continue to work, so if subgroup 2 is not found, select subgroup 3. etc. In the example I use numbers but my focus is on doing this with factor levels.

Is this possible with factors in dplyr? assuming the factor levels are ordered in terms of importance?

Here is my example data:

set.seed(123)
big_group = rep(1:3, each = 6)
small_group = c(sample(1:2, size = 6, replace = TRUE),
                rep(1, each = 6),
                rep(2, each = 6)) %>% 
  as.factor()

d = data.frame(big_group,
               small_group,
               value = runif(n = 3 * 6))

And the ideal output would be

big_group    small_group    values
1            1              0.52810549
2            1              0.67757064
3            2              0.32792072

Upvotes: 2

Answers (3)

Mikko Marttila

Reputation: 11878

Combining both answers from @akrun and @KarolisKoncevičius you could also just do:

d %>%
  group_by(big_group) %>% 
  slice(which.min(small_group))
#> # A tibble: 3 x 3
#> # Groups:   big_group [3]
#>   big_group small_group value
#>       <int> <fct>       <dbl>
#> 1         1 1           0.528
#> 2         2 1           0.678
#> 3         3 2           0.328

Upvotes: 2

akrun

Reputation: 887118

We group by 'big_group', filter the rows having the min value for 'small_group', and then slice the first row

d %>%
   group_by(big_group) %>%
   filter(as.numeric(small_group) == min(as.numeric(small_group))) %>% 
   slice(row_number()==1)
# A tibble: 3 x 3
# Groups: big_group [3]
#   big_group small_group value
#      <int> <fctr>      <dbl>
#1         1 1           0.528
#2         2 1           0.678
#3         3 2           0.328

Or use match with slice

d %>% 
  group_by(big_group) %>% 
  slice(match(levels(droplevels(small_group))[1], levels(droplevels(small_group))))

Upvotes: 2

Karolis Koncevičius

Reputation: 9656

Not a dplyr solution but in R you can do:

do.call(rbind, by(d, d$big_group, function(x) x[which.min(d$small_group),]))

#   big_group small_group     value
# 1         1           1 0.5281055
# 2         2           1 0.6775706
# 3         3           2 0.3279207

Upvotes: 2

For each big group select the best sub-group available in R

Answers (3)

Related Questions