shiny
shiny

Reputation: 3502

dplyr: summarize data.frame to get the highest and lowest values

I have a data.frame that I want to summarise to get the highest (5 values) and the lowest (5 values) for each column. I used iris for a reproducible example.

The highest 5 values for all variables in iris can be obtained using

df_h <-  iris %>% 
  dplyr::select(Species, everything()) %>% 
  tidyr::gather("id", "value", 2:5) %>% 
  dplyr::arrange(Species, id, desc(value)) %>% 
  dplyr::group_by(Species, id ) %>% 
  top_n(n = 5) %>% 
  dplyr::mutate(category = "high")

for the lowest 5 values, I used the same except top_n(n = -5).

df_l <-  iris %>% 
  dplyr::select(Species, everything()) %>% 
  tidyr::gather("id", "value", 2:5) %>% 
  dplyr::arrange(Species, id, desc(value)) %>% 
  dplyr::group_by(Species, id ) %>% 
  top_n(n = -5) %>% 
  dplyr::mutate(category = "low")

Then, I joined the two data.frames together df_h (the highest 5 values) and df_l (the lowest 5 values).

df_fin <-  df_h %>% bind_rows(., df_l)

I'm looking for an efficient/shorter way to get the same result without having to create two data.frames and join them. Any suggestions will be appreciated.

Upvotes: 3

Views: 3431

Answers (2)

tegancp
tegancp

Reputation: 1202

If you want to just extract the extreme values, you can combine the two applications of top_n with a compound condition in filter (note that top_n is just a shortcut to filter using min_rank):

    library(tidyverse)

    iris %>% 
          gather("dims", "value", -Species) %>%
          group_by(Species, dims) %>%
          filter( min_rank(desc(value)) <= 5 | 
                    min_rank(value) <= 5 ) -> df_hi_lo

However, this won't include the high/low categorization.

A more flexible solution is to use a function that returns either one of these category names or an empty string:

hilo <- function(x, n) {
  hi_rk <- min_rank(desc(x))  # change rank function as needed
  lo_rk <- min_rank(x)
  paste0(ifelse(hi_rk <= n, "high", ""),
                ifelse(lo_rk <= n, "low",""))

I used the min_rank function here, which duplicates the behavior of top_n, but you should also consider replacing it with dense_rank.

This allows you to add the category for all rows, then filter to just the high/low values:

iris %>% gather("dims", "value", -Species) %>%
  group_by(Species, dims) %>%
  mutate(category = hilo(value, 5) ) %>%
  filter(category != "") -> df_hl

Upvotes: 4

Adam Bethke
Adam Bethke

Reputation: 1038

If I'm understanding your question correctly, I think you could use a ranking function to do this task programmatically and in a single data frame.

I pulled together an example using the iris dataset (below). Basically, it does three things beyond your initial starting spot:

  1. Creates a temporary variable called rank which calculates and determines the value's position in its column of values. I used dense_rank from dplyr as a proof of concept, but you may want a different ranking function depending on your purpose.

  2. Calculates the high_category_lower_bound, which is the 5th highest ranked value for the group. It also sets the low_category_upper_bound, which is nice for programmatically controlling it, but in this example is hard-coded as 5.

  3. Filters the dataframe to include only values less than or equal to the low_category_upper_bound or greater than or equal to high_category_lower_bound, and then does an ifelse lookup to create the category.

Notes:

  • I also forced distinct on each column/value pair. I did this to make the solution simpler. If you want the non-distinct values, you'll potentially need to adapt the ranking function. The dense rank gives them the same rank, so they'd be returned multiple times without making it distinct first.

  • I replaced 2:5 with 1:ncol(.), but your mileage may vary depending on the exact datasets you're actually using.

  • If you cared about the reason(s) why a record was included in the dataset, you could retain some of the temporary columns I creted and then removed from the final results.

  • Depending on the scale of the dataset(s) you're actually using, this solution may be more or less effective than you'd like. The downside of this solution is that it has to rank all of the values, which could be potentially expensive on a large-scale dataset. I'm personally a fan of using the solution and seeing if it'll scale to your needs, but it's something to be aware of. On the iris dataset it returns really quickly. On a couple million rows, it may take a bit longer.


library(dplyr)
library(tidyr)

df_all <- 
    iris %>%
    # gather all columns
    gather("column", "value", 1:ncol(.)) %>% 
    # filter to only values which can be evaluated as high/low;
    # you could expand this to include factor variables, but 
    # that's beyond the scope of this question and you'd have to
    # redefine the factor levels before this step
    filter(!is.na(as.numeric(value))) %>%
    # get distinct values - optional but probably helpful
    distinct(column, value) %>%
    # group by and perform set operations on 
    group_by(column) %>%
    # create ranking sequence
    mutate(
        rank = dense_rank(value),
        low_category_upper_bound = 5,
        high_rank = max(rank),
        high_category_lower_bound = high_rank - 4 
    ) %>%
    # retain only top and bottom values
    # filter and create category label
    filter(
        rank <= low_category_upper_bound | 
        rank >= high_category_lower_bound
    ) %>%
    mutate(
        category = ifelse(rank >= high_category_lower_bound, "high", ""),
        category = ifelse(rank <= low_category_upper_bound, "low", category)
    ) %>%
    # select columns of interest
    select(column, value, category)

Upvotes: 2

Related Questions