Reputation: 6117
I want to calculate ntiles using only a subset of the rows, using tidyverse. The following base R code does what I want:
Base R:
diamonds$conditional_quartiles_var <- NA
diamonds$conditional_quartiles_var[ diamonds$price >= 1000 ] <- ntile( diamonds$price[ diamonds$price >= 1000 ], n = 4 )
diamonds$conditional_quartiles_var[ diamonds$price < 1000 ] <- "Less than 1000"
diamonds %>% count(conditional_quartiles_var)
Outputs (what I want):
# A tibble: 5 x 2
conditional_quartiles_var n
<chr> <int>
1 1 9861
2 2 9860
3 3 9860
4 4 9860
5 Less than 1000 14499
The above results are what I want, because the ntiles are calculated only from values of price >= 1000.
Tidyverse attempt
My tidyverse implementation fails because the ntiles are calculated from the whole price-vector:
library(tidyverse)
diamonds %>%
mutate( wrong_conditional_quartiles_var = case_when( price >= 1000 ~ ntile(price, n = 4) %>% as.character(),
price < 1000 ~ "Less than 1000")) %>%
count( wrong_conditional_quartiles_var)
Outputs (not what I want):
# A tibble: 4 x 2
wrong_conditional_quartiles_var n
<chr> <int>
1 2 12471
2 3 13485
3 4 13485
4 Less than 1000 14499
Upvotes: 2
Views: 554
Reputation: 887223
We could use replace
library(dplyr)
diamonds %>%
mutate(quart = "Less than 1000",
quart = replace(quart, price >= 1000, ntile(price[price>=1000], 4))) %>%
count(quart)
# A tibble: 5 x 2
# quart n
# <chr> <int>
#1 1 9861
#2 2 9860
#3 3 9860
#4 4 9860
#5 Less than 1000 14499
Upvotes: 2