Reputation: 119
I am looking for a code or feature that assigns a value to the 5 highest values and 5 lowest values. This could, for example, be from a dataset similar to this:
df <- data.frame(
Date = c(rep("2010-01-31",16), rep("2010-02-28", 14)),
Value=c(rep(c(1,2,3,4,5,6,7,8,9,NA,NA,NA,NA,NA,15),2))
)
Edit: This is just a sample data. The data I use is more complex and the code should, therefore, allow for varying lengths of the column Date and also for multiple values that are missing (NAs).
I would then like a value assigned to the five lowest equal to "5w" and "5b" to the 5 highest values The data should then be wrapped in a group_by based on the date so that the process is repeated at each period. I have tried using percentile but this method does not maintain a constant number of values in each bracket. I am therefore looking for a method that allows the number of values in each bracket to be constant. If it is possible it would be nice to put all firms into 5% brackets. By this, I mean to have 20 brackets with all firms distributed. This means that the best bracket would consist of 5% of total firms with the highest value. The values could be 0:19. I.e meaning that the expected output of a firm in the highest value bracket would be 19 and a firm in the lowest bracket would receive a value of 0.
Thanks In advance
Upvotes: 0
Views: 608
Reputation: 160437
Heads up: while I suspect that this is just sample data, you have two 1
s in 2010-01-31
. This code accounts for that, but when unsorted the output looks odd. For that, I'm adding arrange
to show them.
I use min_rank
here, assuming that you do not want ties and always want top/bottom 5. An alternative is dense_rank
, which would label the top six from 2010-01-31
due to tie for 1
.
library(dpyr)
dat %>%
group_by(Date) %>%
mutate(
R = min_rank(Value),
Quux = case_when(
R < 6 ~ "5w",
R > n() - 5 ~ "5b",
TRUE ~ NA_character_)
) %>%
ungroup() %>%
arrange(Date, Value) %>%
print(n=99)
# # A tibble: 30 x 4
# Date Value R Quux
# <fct> <int> <int> <chr>
# 1 2010-01-31 1 1 5w
# 2 2010-01-31 1 1 5w
# 3 2010-01-31 2 3 5w
# 4 2010-01-31 3 4 5w
# 5 2010-01-31 4 5 5w
# 6 2010-01-31 5 6 <NA>
# 7 2010-01-31 6 7 <NA>
# 8 2010-01-31 7 8 <NA>
# 9 2010-01-31 8 9 <NA>
# 10 2010-01-31 9 10 <NA>
# 11 2010-01-31 10 11 <NA>
# 12 2010-01-31 11 12 5b
# 13 2010-01-31 12 13 5b
# 14 2010-01-31 13 14 5b
# 15 2010-01-31 14 15 5b
# 16 2010-01-31 15 16 5b
# 17 2010-02-28 2 1 5w
# 18 2010-02-28 3 2 5w
# 19 2010-02-28 4 3 5w
# 20 2010-02-28 5 4 5w
# 21 2010-02-28 6 5 5w
# 22 2010-02-28 7 6 <NA>
# 23 2010-02-28 8 7 <NA>
# 24 2010-02-28 9 8 <NA>
# 25 2010-02-28 10 9 <NA>
# 26 2010-02-28 11 10 5b
# 27 2010-02-28 12 11 5b
# 28 2010-02-28 13 12 5b
# 29 2010-02-28 14 13 5b
# 30 2010-02-28 15 14 5b
Edit using newly-discovered data. I'm inferring that the NA
values should be ignored, and only the ranked ones should be considered. This shows a condition where there are not 10 total valued rows, as 2010-02-28
only provides 4 5b
.
dat %>%
group_by(Date) %>%
mutate(
R = min_rank(Value),
Quux = case_when(
R < 6 ~ "5w",
R > max(R, na.rm = TRUE) - 5 ~ "5b",
TRUE ~ NA_character_)
) %>%
ungroup() %>%
arrange(Date, Value) %>%
print(n=99)
# # A tibble: 30 x 4
# Date Value R Quux
# <fct> <dbl> <int> <chr>
# 1 2010-01-31 1 1 5w
# 2 2010-01-31 1 1 5w
# 3 2010-01-31 2 3 5w
# 4 2010-01-31 3 4 5w
# 5 2010-01-31 4 5 5w
# 6 2010-01-31 5 6 <NA>
# 7 2010-01-31 6 7 5b
# 8 2010-01-31 7 8 5b
# 9 2010-01-31 8 9 5b
# 10 2010-01-31 9 10 5b
# 11 2010-01-31 15 11 5b
# 12 2010-01-31 NA NA <NA>
# 13 2010-01-31 NA NA <NA>
# 14 2010-01-31 NA NA <NA>
# 15 2010-01-31 NA NA <NA>
# 16 2010-01-31 NA NA <NA>
# 17 2010-02-28 2 1 5w
# 18 2010-02-28 3 2 5w
# 19 2010-02-28 4 3 5w
# 20 2010-02-28 5 4 5w
# 21 2010-02-28 6 5 5w
# 22 2010-02-28 7 6 5b
# 23 2010-02-28 8 7 5b
# 24 2010-02-28 9 8 5b
# 25 2010-02-28 15 9 5b
# 26 2010-02-28 NA NA <NA>
# 27 2010-02-28 NA NA <NA>
# 28 2010-02-28 NA NA <NA>
# 29 2010-02-28 NA NA <NA>
# 30 2010-02-28 NA NA <NA>
Upvotes: 1
Reputation: 1618
Try this
library(dplyr)
DF <- data.frame(Date = c(rep("2010-01-31",16), rep("2010-02-28", 14)), Value=c(rep(1:15,2)))
DF %>%
group_by(Date) %>%
mutate(n = row_number(),
nn = 1 + max(n) - n) %>%
ungroup() %>%
filter(n <= 5 | nn <= 5) %>%
mutate(v1 = if_else(n <= 5, "5b", "5w")) %>%
select(-n, -nn)
Output
# A tibble: 20 x 3
Date Value v1
<fct> <int> <chr>
1 2010-01-31 1 5b
2 2010-01-31 2 5b
3 2010-01-31 3 5b
4 2010-01-31 4 5b
5 2010-01-31 5 5b
6 2010-01-31 12 5w
7 2010-01-31 13 5w
8 2010-01-31 14 5w
9 2010-01-31 15 5w
10 2010-01-31 1 5w
11 2010-02-28 2 5b
12 2010-02-28 3 5b
13 2010-02-28 4 5b
14 2010-02-28 5 5b
15 2010-02-28 6 5b
16 2010-02-28 11 5w
17 2010-02-28 12 5w
18 2010-02-28 13 5w
19 2010-02-28 14 5w
20 2010-02-28 15 5w
Upvotes: 0