arshad
arshad

Reputation: 437

Why do quantile function in R gives unequal count of values in each group

I am trying to group a continuous value data into tertile. I am using the function quantile to do this. following is my code

dd$wbc_tert = with(dd, 
               cut(wbc, 
                   vTert, 
                   include.lowest = T, 
                   labels = c("Low", "Medium", "High")))

Isn't it supposed to give equal count of values in each group? I am getting different count in the groups.

> dd %>% filter(wbc_tert == 'High') %>% select('wbc')  %>% nrow() 
[1] 143
> dd %>% filter(wbc_tert == 'Low') %>% select('wbc')  %>% nrow()
[1] 148
> dd %>% filter(wbc_tert == 'Medium') %>% select('wbc')  %>% nrow()
[1] 139

This is the dput of the values

c(10.9, 5.4, 9.1, 7.4, 6.6, 5.5, 4.4, 6.7, 7.8, 6.7, 6.6, 8.6, 
8.4, 4.8, 7, 5.2, 7, 6.7, 10.4, 7.5, 8.5, 6.8, 8.5, 9.4, 4.6, 
6.8, 10.2, 6.7, 4.6, 4.9, 6.7, 8.9, 5.9, 5.9, 9.9, 4.1, 8.4, 
9, 7.7, 8.2, 5.7, 8.4, 7.7, 4.6, 6.5, 7.3, 4.9, 3.8, 6.2, 7.9, 
5.3, 8.9, 6, 4.8, 5.9, 5.4, 8.6, 6.1, 9.5, 5.8, 6.2, 5.8, 7.9, 
9.6, 6.6, 9.6, 7, 10.1, 9, 6.9, 9.1, 6.8, 8.4, 9.6, 4.4, 10.5, 
7.9, 5.6, 5.1, 6.6, 6.5, 12.7, 5.3, 7.7, 4.8, 4.7, 6.1, 4.3, 
6.1, 11.6, 5.9, 7.4, 5.7, 4.7, 4.8, 8.5, 5.6, 7.9, 9.1, 7.8, 
5.3, 5, 8.1, 8.3, 4.7, 5.4, 7.6, 7.2, 5.7, 7.9, 7.9, 6.4, 3.8, 
4.7, 6.2, 5, 7.6, 5.8, 5.4, 4.3, 6, 4.7, 6, 6.1, 5.8, 5.6, 4.7, 
5, 11.5, 6.3, 4.4, 6.8, 6.6, 6.8, 6.1, 4.8, 5.4, 5.8, 5.2, 7.1, 
5.4, 9.1, 6.9, 5.4, 8.5, 5.3, 7.3, 6.9, 9, 6.3, 8.4, 7.8, 5.7, 
6.4, 5.3, 9.6, 6.4, 9.9, 8.9, 7.7, 6.2, 7.2, 4.6, 5.4, 4.6, 11.2, 
3.1, 12.3, 5.9, 11.1, 6.2, 6.6, 4.1, 7.4, 9.4, 4.1, 6.7, 6.7, 
6.1, 6.3, 5.6, NA, 3.7, 6.8, 6.7, 6.4, 7.3, 5.7, 6.7, 6.9, 5.7, 
5.3, 4, 5.6, 4.8, 5.5, 6, 6.6, 3.6, 5.6, 8.9, 6.3, 5.8, 8.2, 
8.6, 8.5, 5.7, 8.6, 6, 5.1, 5.7, 8.2, 5.4, 6.9, 6.9, 8.3, 9.5, 
5.4, 10.2, 8.8, 7.2, 4.8, 9.8, 4.6, 6.3, 5.8, 4.9, 12.7, 7.5, 
10.6, 9.3, 5.5, 10.7, 6.2, 9.3, 8.3, 7.8, 8.05, 9.57, 6.62, 6.21, 
5.34, 6.11, 10.37, 4.45, 5.55, 8.05, 8.31, 5.06, 6.05, 4.76, 
9.09, 9.11, 9.04, 6.99, 6.33, 9.47, 6.48, 4.46, 9.44, 6.88, 7.09, 
5.75, 10.89, 6.68, 3.64, 6.55, 8.69, 5.89, 9.05, 6.38, 11.62, 
9.11, 9.22, 7.97, 9.64, 12.76, 8.39, 6.57, 8.1, 7.3, 10.1, 4.7, 
6.4, 7.2, 5.5, 3.7, 5.1, 9.8, 7.6, 7.7, 6, 3.9, 6.8, 5.4, 5.4, 
9.7, 9, 6, 7.3, 6.3, 5.8, 8.3, 7, 4.1, 11.2, 5, 7.6, 6.5, 4.8, 
8, 10.1, 7.1, 7.4, 4.3, 4, 10.12, 4.3, 7.26, 8.84, 8.44, 8.44, 
8.12, 6.5, 8.58, 8.55, 8.82, 4.53, 9.51, 4.93, 4.42, 4.69, 8.69, 
5.77, 3.37, 6.58, 3.72, 3.09, 7.13, 8.11, 7.2, 12.18, 6.52, 7.91, 
5.69, 8.24, 7.67, 5.69, 4.85, 7.03, 4.16, 3.57, 8.1, 4.61, 5.98, 
5.13, 7.68, 5.47, 5.54, 4.59, 6, 11.62, 7.38, 7.06, 8.74, 8.02, 
6.73, 7.19, 6.36, 4.86, 6.55, 8.4, 7.76, 4.73, 4.8, 5.73, 8.53, 
4.6, 7.96, 9.48, 6.59, 5.75, 6.61, 6.49, 7.91, 6.92, 7.14, 6.24, 
12.53, 7.03, 4.73, 8.05, 7.26, 4.07, 6.7, 5.7, 7.39, 5.2, 6.61, 
6.8, 6.77, 5.65, 6.08, 7.24, 6.13, 7.92, 7.37, 7.99, 3.31, 9.72, 
8.71, 8.35, 5.05, 8.15, 5.1, 5.4, 8.8, 4.9, 5, 7.43, 10.3, 6.3, 
9.5, 6.9, 6.7, 5.4, 7.7, 8, 6.5, 5.6, 9.7)

Can someone please help what could be the reason

Upvotes: 2

Views: 740

Answers (2)

Rui Barradas
Rui Barradas

Reputation: 76470

There are 9 ways to compute the quantiles with function quantile, the default is type = 7.
Use findInterval instead of cut.

vTert <- quantile(dd$wbc, (0:3)/3, na.rm = TRUE)

dd$wbc_tert <- findInterval(dd$wbc, vTert, rightmost.closed = TRUE, all.inside = TRUE)
dd$wbc_tert <- factor(dd$wbc_tert, labels = c("Low", "Medium", "High"))

table(dd$wbc_tert, useNA = "always")
#> 
#>    Low Medium   High   <NA> 
#>    143    143    144      1

Created on 2022-08-18 by the reprex package (v2.0.1)

Upvotes: 1

Baraliuh
Baraliuh

Reputation: 2141

Here is an example that shows that quantiles are not necessarily symmetric in the number of values.

# Define some data
x <- 1:10
y <-  rep(1:2, 10)
# Look at the quantiles
quantile(x)
#>    0%   25%   50%   75%  100% 
#>  1.00  3.25  5.50  7.75 10.00
# Due to the added y we now have asymmetry in sizes
quantile(c(y,x))
#>    0%   25%   50%   75%  100% 
#>  1.00  1.00  2.00  2.75 10.00
# Notice how the number of values below 50 % and 75 % changes.
## Without y we get roughly the same bin size
sum(x<quantile(x, .5))
#> [1] 5
sum(x<quantile(x, .75))
#> [1] 7
## But when we add y, there is a doubling of values despite we only increase
## the percentile with 25 %
sum(c(y,x)<quantile(c(y,x), .5))
#> [1] 11
sum(c(y,x)<quantile(c(y,x), .75))
#> [1] 22

Created on 2022-08-18 by the reprex package (v2.0.1)

Upvotes: 2

Related Questions