Reputation: 3502
I have a data.frame that I want to summarise to get the highest (5 values) and the lowest (5 values) for each column. I used iris
for a reproducible example.
The highest 5 values for all variables in iris
can be obtained using
df_h <- iris %>%
dplyr::select(Species, everything()) %>%
tidyr::gather("id", "value", 2:5) %>%
dplyr::arrange(Species, id, desc(value)) %>%
dplyr::group_by(Species, id ) %>%
top_n(n = 5) %>%
dplyr::mutate(category = "high")
for the lowest 5 values, I used the same except top_n(n = -5)
.
df_l <- iris %>%
dplyr::select(Species, everything()) %>%
tidyr::gather("id", "value", 2:5) %>%
dplyr::arrange(Species, id, desc(value)) %>%
dplyr::group_by(Species, id ) %>%
top_n(n = -5) %>%
dplyr::mutate(category = "low")
Then, I joined the two data.frames together df_h (the highest 5 values) and df_l (the lowest 5 values).
df_fin <- df_h %>% bind_rows(., df_l)
I'm looking for an efficient/shorter way to get the same result without having to create two data.frames and join them. Any suggestions will be appreciated.
Upvotes: 3
Views: 3431
Reputation: 1202
If you want to just extract the extreme values, you can combine the two applications of top_n
with a compound condition in filter
(note that top_n
is just a shortcut to filter
using min_rank
):
library(tidyverse)
iris %>%
gather("dims", "value", -Species) %>%
group_by(Species, dims) %>%
filter( min_rank(desc(value)) <= 5 |
min_rank(value) <= 5 ) -> df_hi_lo
However, this won't include the high/low categorization.
A more flexible solution is to use a function that returns either one of these category names or an empty string:
hilo <- function(x, n) {
hi_rk <- min_rank(desc(x)) # change rank function as needed
lo_rk <- min_rank(x)
paste0(ifelse(hi_rk <= n, "high", ""),
ifelse(lo_rk <= n, "low",""))
I used the min_rank
function here, which duplicates the behavior of top_n
, but you should also consider replacing it with dense_rank
.
This allows you to add the category for all rows, then filter to just the high/low values:
iris %>% gather("dims", "value", -Species) %>%
group_by(Species, dims) %>%
mutate(category = hilo(value, 5) ) %>%
filter(category != "") -> df_hl
Upvotes: 4
Reputation: 1038
If I'm understanding your question correctly, I think you could use a ranking function to do this task programmatically and in a single data frame.
I pulled together an example using the iris
dataset (below). Basically, it does three things beyond your initial starting spot:
Creates a temporary variable called rank
which calculates and determines the value's position in its column of values. I used dense_rank
from dplyr as a proof of concept, but you may want a different ranking function depending on your purpose.
Calculates the high_category_lower_bound
, which is the 5th highest ranked value for the group. It also sets the low_category_upper_bound
, which is nice for programmatically controlling it, but in this example is hard-coded as 5.
Filters the dataframe to include only values less than or equal to the low_category_upper_bound
or greater than or equal to high_category_lower_bound
, and then does an ifelse
lookup to create the category.
Notes:
I also forced distinct
on each column/value pair. I did this to make the solution simpler. If you want the non-distinct values, you'll potentially need to adapt the ranking function. The dense rank gives them the same rank, so they'd be returned multiple times without making it distinct first.
I replaced 2:5
with 1:ncol(.)
, but your mileage may vary depending on the exact datasets you're actually using.
If you cared about the reason(s) why a record was included in the dataset, you could retain some of the temporary columns I creted and then removed from the final results.
Depending on the scale of the dataset(s) you're actually using, this solution may be more or less effective than you'd like. The downside of this solution is that it has to rank all of the values, which could be potentially expensive on a large-scale dataset. I'm personally a fan of using the solution and seeing if it'll scale to your needs, but it's something to be aware of. On the iris
dataset it returns really quickly. On a couple million rows, it may take a bit longer.
library(dplyr)
library(tidyr)
df_all <-
iris %>%
# gather all columns
gather("column", "value", 1:ncol(.)) %>%
# filter to only values which can be evaluated as high/low;
# you could expand this to include factor variables, but
# that's beyond the scope of this question and you'd have to
# redefine the factor levels before this step
filter(!is.na(as.numeric(value))) %>%
# get distinct values - optional but probably helpful
distinct(column, value) %>%
# group by and perform set operations on
group_by(column) %>%
# create ranking sequence
mutate(
rank = dense_rank(value),
low_category_upper_bound = 5,
high_rank = max(rank),
high_category_lower_bound = high_rank - 4
) %>%
# retain only top and bottom values
# filter and create category label
filter(
rank <= low_category_upper_bound |
rank >= high_category_lower_bound
) %>%
mutate(
category = ifelse(rank >= high_category_lower_bound, "high", ""),
category = ifelse(rank <= low_category_upper_bound, "low", category)
) %>%
# select columns of interest
select(column, value, category)
Upvotes: 2