dplyr: summarize data.frame to get the highest and lowest values

Question

I have a data.frame that I want to summarise to get the highest (5 values) and the lowest (5 values) for each column. I used iris for a reproducible example.

The highest 5 values for all variables in iris can be obtained using

df_h <-  iris %>% 
  dplyr::select(Species, everything()) %>% 
  tidyr::gather("id", "value", 2:5) %>% 
  dplyr::arrange(Species, id, desc(value)) %>% 
  dplyr::group_by(Species, id ) %>% 
  top_n(n = 5) %>% 
  dplyr::mutate(category = "high")

for the lowest 5 values, I used the same except top_n(n = -5).

df_l <-  iris %>% 
  dplyr::select(Species, everything()) %>% 
  tidyr::gather("id", "value", 2:5) %>% 
  dplyr::arrange(Species, id, desc(value)) %>% 
  dplyr::group_by(Species, id ) %>% 
  top_n(n = -5) %>% 
  dplyr::mutate(category = "low")

Then, I joined the two data.frames together df_h (the highest 5 values) and df_l (the lowest 5 values).

df_fin <-  df_h %>% bind_rows(., df_l)

I'm looking for an efficient/shorter way to get the same result without having to create two data.frames and join them. Any suggestions will be appreciated.

tegancp · Accepted Answer

If you want to just extract the extreme values, you can combine the two applications of top_n with a compound condition in filter (note that top_n is just a shortcut to filter using min_rank):

    library(tidyverse)

    iris %>% 
          gather("dims", "value", -Species) %>%
          group_by(Species, dims) %>%
          filter( min_rank(desc(value)) <= 5 | 
                    min_rank(value) <= 5 ) -> df_hi_lo

However, this won't include the high/low categorization.

A more flexible solution is to use a function that returns either one of these category names or an empty string:

hilo <- function(x, n) {
  hi_rk <- min_rank(desc(x))  # change rank function as needed
  lo_rk <- min_rank(x)
  paste0(ifelse(hi_rk <= n, "high", ""),
                ifelse(lo_rk <= n, "low",""))

I used the min_rank function here, which duplicates the behavior of top_n, but you should also consider replacing it with dense_rank.

This allows you to add the category for all rows, then filter to just the high/low values:

iris %>% gather("dims", "value", -Species) %>%
  group_by(Species, dims) %>%
  mutate(category = hilo(value, 5) ) %>%
  filter(category != "") -> df_hl

dplyr: summarize data.frame to get the highest and lowest values

Answers (2)

Related Questions