Champ
Champ

Reputation: 133

Split dataframe column into quantile with no duplication of quantile for any value in R

I have a data frame defined as below:

  structure(list(value = c(1, 1, 2, 2, 2, 2, 2, 3, 4, 5)), class = "data.frame", row.names = c(NA, 
-10L)) 

I want to split column 'value' into 'n' quantile (let say n=3) such that any value should not fall into 2 quantile. For ex: value '2' should get unique quantile

I tried using 'ntile' function as below

df1 <- mutate(df,R_rank=ntile(df$value,3))

Result is:

structure(list(value = c(1, 1, 2, 2, 2, 2, 2, 3, 4, 5), R_rank = c(1L, 
1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L)), class = "data.frame", row.names = c(NA, 
-10L))

Here value '2' is falling into 2 different quantile (1 and 2) but I want any value should fall into unique quantile.

How can I do this in R ?

Upvotes: 2

Views: 805

Answers (2)

Rui Barradas
Rui Barradas

Reputation: 76402

Maybe the solution is to set the quantile argument type to a value other than the default value type = 7.

n <- 3

q5 <- quantile(V, probs = seq(0, 1, length.out = n + 1), type = 5)
q6 <- quantile(V, probs = seq(0, 1, length.out = n + 1), type = 6)
q8 <- quantile(V, probs = seq(0, 1, length.out = n + 1), type = 8)
q9 <- quantile(V, probs = seq(0, 1, length.out = n + 1), type = 9)

And to split the input vector:

split(V, findInterval(V, q5))
split(V, findInterval(V, q6))
split(V, findInterval(V, q8))
split(V, findInterval(V, q9))

The split instructions above all give the same results. See below.

The values 5, 6, 8 and 9 were found with the following code:

sapply(1:9, function(i)
  quantile(V, probs = seq(0, 1, length.out = n + 1), type = i)
)
#          [,1] [,2] [,3] [,4]     [,5]     [,6] [,7]     [,8]     [,9]
#0%           1    1    1    1 1.000000 1.000000    1 1.000000 1.000000
#33.33333%    2    2    2    2 2.000000 2.000000    2 2.000000 2.000000
#66.66667%    2    2    2    2 2.166667 2.333333    2 2.222222 2.208333
#100%         5    5    5    5 5.000000 5.000000    5 5.000000 5.000000

As columns 5, 6, 8 and 9 have the 2/3 quantiles different from 2, those types can be chosen to address the question.

The 2/3 quantiles are all between 2 and 3, that's the reason why the split instructions all output the same list.

Upvotes: 2

Ronak Shah
Ronak Shah

Reputation: 388982

Probably you can use cut

cut(df$value, 3, labels = FALSE)
#[1] 1 1 1 1 1 1 1 2 3 3

where

df$value #is
#[1] 1 1 2 2 2 2 2 3 4 5

So 1-2 fall into group 1, 3 falls into group 2 and 4-5 in group3.

Upvotes: 2

Related Questions