filtering data based on rank and conditions

Question

I have some data which looks similar to the following:

# A tibble: 2,717 x 6
# Groups:   date [60]
   symbol date       monthly.returns score totals score_rank
                             
 1 GIS    2010-01-29        0.0128   0.436  119.           2
 2 GIS    2010-02-26        0.00982  0.205  120.           1
 3 GIS    2010-03-31       -0.0169   0.549   51.1          3
 4 GIS    2010-04-30        0.0123   0.860   28.0          4
 5 GIS    2010-05-28        0.000984 0.888   91.6          4
 6 GIS    2010-06-30       -0.00267  0.828   15.5          4
 7 GIS    2010-07-30       -0.0297   0.482   81.7          2
 8 GIS    2010-08-31        0.0573   0.408   57.2          3
 9 GIS    2010-09-30        0.0105   0.887   93.3          4
10 GIS    2010-10-29        0.0357   0.111   96.6          1
# ... with 2,707 more rows

I have a score_rank, what I want to do is whenever the totals column is > 100 filter the data in the following way:

1) When the score_rank = 1, take the top 5% of observations based on the score column

2) When the score_rank = 2 or 3, take a random sample of 5% of the observations

3) When the score_rank = 4, take the bottom 5% of observations based on the score column.

Data:

tickers <- c("GIS", "KR", "MKC", "SJM", "EL", "HRL", "HSY", "K", 
             "KMB", "MDLZ", "MNST", "PEP", "PG", "PM", "SYY", "TAP", "TSN", "WBA", "WMT",
             "MMM", "ABMD", "ACN", "AMD", "AES", "AON", "ANTM", "APA", "CSCO", "CMS", "KO", "GRMN", "GPS",
             "JEC", "SJM", "JPM", "JNPR", "KSU", "KEYS", "KIM", "NBL", "NEM", "NWL", "NFLX", "NEE", "NOC", "TMO", "TXN", "TWTR")

library(tidyquant)
data <- tq_get(tickers,
               get = "stock.prices",              # Collect the stock price data from 2010 - 2015
               from = "2010-01-01",
               to = "2015-01-01") %>%
  group_by(symbol) %>%
  tq_transmute(select = adjusted,                 # Convert the data from daily prices to monthly prices
               mutate_fun = periodReturn,
               period = "monthly",
               type = "arithmetic")

data$score <- runif(nrow(data), min = 0, max = 1)
data$totals <- runif(nrow(data), min = 10, max = 150)

data <- data %>%
  group_by(date) %>%
  mutate(
    score_rank = ntile(score, 4)
  )

Edit: Added code.

akrun · Accepted Answer

Here is one option to filter. Create a list of functions (fs) for each corresponding 'score_rank', use map2 to loop over the list functions and the corresponding 'score_rank' list of vectors, filter the 'data' where the 'totals' is greater than 100, and the 'score_rank' %in% the input from map2 vector, apply the function on 'score' column to filter the sample of rows and bind the subset data with the data filtered where 'totals' is less than or equal to 100

library(purrr)
library(dplyr)
fs <- list(as_mapper(~  . >= quantile(., prob = 0.95)), 
       as_mapper(~ row_number() %in% sample(row_number(), round(0.05 * n()) )),
       as_mapper(~  . <= quantile(., prob = 0.05))
       )


map2_df(list(1, c(2, 3), 4), fs, ~          

    data %>%        
        filter(totals > 100, score_rank %in% .x) %>%
        filter(.y(score))

         )%>%   bind_rows(data %>%
                            filter(totals <= 100))

filtering data based on rank and conditions

Answers (1)

Related Questions