Top 5 and bottom 5 in r using Group_by

Question

I am looking for a code or feature that assigns a value to the 5 highest values and 5 lowest values. This could, for example, be from a dataset similar to this:

df <- data.frame(
   Date = c(rep("2010-01-31",16), rep("2010-02-28", 14)), 
   Value=c(rep(c(1,2,3,4,5,6,7,8,9,NA,NA,NA,NA,NA,15),2))
)

Edit: This is just a sample data. The data I use is more complex and the code should, therefore, allow for varying lengths of the column Date and also for multiple values that are missing (NAs).

I would then like a value assigned to the five lowest equal to "5w" and "5b" to the 5 highest values The data should then be wrapped in a group_by based on the date so that the process is repeated at each period. I have tried using percentile but this method does not maintain a constant number of values in each bracket. I am therefore looking for a method that allows the number of values in each bracket to be constant. If it is possible it would be nice to put all firms into 5% brackets. By this, I mean to have 20 brackets with all firms distributed. This means that the best bracket would consist of 5% of total firms with the highest value. The values could be 0:19. I.e meaning that the expected output of a firm in the highest value bracket would be 19 and a firm in the lowest bracket would receive a value of 0.

Thanks In advance

r2evans · Accepted Answer

Heads up: while I suspect that this is just sample data, you have two 1s in 2010-01-31. This code accounts for that, but when unsorted the output looks odd. For that, I'm adding arrange to show them.

I use min_rank here, assuming that you do not want ties and always want top/bottom 5. An alternative is dense_rank, which would label the top six from 2010-01-31 due to tie for 1.

library(dpyr)
dat %>%
  group_by(Date) %>%
  mutate(
    R = min_rank(Value),
    Quux = case_when(
      R < 6       ~ "5w",
      R > n() - 5 ~ "5b",
      TRUE        ~ NA_character_)
    ) %>%
  ungroup() %>%
  arrange(Date, Value) %>%
  print(n=99)
# # A tibble: 30 x 4
#    Date       Value     R Quux 
#            
#  1 2010-01-31     1     1 5w   
#  2 2010-01-31     1     1 5w   
#  3 2010-01-31     2     3 5w   
#  4 2010-01-31     3     4 5w   
#  5 2010-01-31     4     5 5w   
#  6 2010-01-31     5     6  
#  7 2010-01-31     6     7  
#  8 2010-01-31     7     8  
#  9 2010-01-31     8     9  
# 10 2010-01-31     9    10  
# 11 2010-01-31    10    11  
# 12 2010-01-31    11    12 5b   
# 13 2010-01-31    12    13 5b   
# 14 2010-01-31    13    14 5b   
# 15 2010-01-31    14    15 5b   
# 16 2010-01-31    15    16 5b   
# 17 2010-02-28     2     1 5w   
# 18 2010-02-28     3     2 5w   
# 19 2010-02-28     4     3 5w   
# 20 2010-02-28     5     4 5w   
# 21 2010-02-28     6     5 5w   
# 22 2010-02-28     7     6  
# 23 2010-02-28     8     7  
# 24 2010-02-28     9     8  
# 25 2010-02-28    10     9  
# 26 2010-02-28    11    10 5b   
# 27 2010-02-28    12    11 5b   
# 28 2010-02-28    13    12 5b   
# 29 2010-02-28    14    13 5b   
# 30 2010-02-28    15    14 5b

Edit using newly-discovered data. I'm inferring that the NA values should be ignored, and only the ranked ones should be considered. This shows a condition where there are not 10 total valued rows, as 2010-02-28 only provides 4 5b.

dat %>%
  group_by(Date) %>%
  mutate(
    R = min_rank(Value),
    Quux = case_when(
      R < 6                        ~ "5w",
      R > max(R, na.rm = TRUE) - 5 ~ "5b",
      TRUE                         ~ NA_character_)
    ) %>%
  ungroup() %>%
  arrange(Date, Value) %>%
  print(n=99)

# # A tibble: 30 x 4
#    Date       Value     R Quux 
#            
#  1 2010-01-31     1     1 5w   
#  2 2010-01-31     1     1 5w   
#  3 2010-01-31     2     3 5w   
#  4 2010-01-31     3     4 5w   
#  5 2010-01-31     4     5 5w   
#  6 2010-01-31     5     6  
#  7 2010-01-31     6     7 5b   
#  8 2010-01-31     7     8 5b   
#  9 2010-01-31     8     9 5b   
# 10 2010-01-31     9    10 5b   
# 11 2010-01-31    15    11 5b   
# 12 2010-01-31    NA    NA  
# 13 2010-01-31    NA    NA  
# 14 2010-01-31    NA    NA  
# 15 2010-01-31    NA    NA  
# 16 2010-01-31    NA    NA  
# 17 2010-02-28     2     1 5w   
# 18 2010-02-28     3     2 5w   
# 19 2010-02-28     4     3 5w   
# 20 2010-02-28     5     4 5w   
# 21 2010-02-28     6     5 5w   
# 22 2010-02-28     7     6 5b   
# 23 2010-02-28     8     7 5b   
# 24 2010-02-28     9     8 5b   
# 25 2010-02-28    15     9 5b   
# 26 2010-02-28    NA    NA  
# 27 2010-02-28    NA    NA  
# 28 2010-02-28    NA    NA  
# 29 2010-02-28    NA    NA  
# 30 2010-02-28    NA    NA

Top 5 and bottom 5 in r using Group_by

Answers (2)

Related Questions