Reputation: 133
I have a data frame defined as below:
structure(list(value = c(1, 1, 2, 2, 2, 2, 2, 3, 4, 5)), class = "data.frame", row.names = c(NA,
-10L))
I want to split column 'value' into 'n' quantile (let say n=3) such that any value should not fall into 2 quantile. For ex: value '2' should get unique quantile
I tried using 'ntile' function as below
df1 <- mutate(df,R_rank=ntile(df$value,3))
Result is:
structure(list(value = c(1, 1, 2, 2, 2, 2, 2, 3, 4, 5), R_rank = c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L)), class = "data.frame", row.names = c(NA,
-10L))
Here value '2' is falling into 2 different quantile (1 and 2) but I want any value should fall into unique quantile.
How can I do this in R ?
Upvotes: 2
Views: 805
Reputation: 76402
Maybe the solution is to set the quantile
argument type
to a value other than the default value type = 7
.
n <- 3
q5 <- quantile(V, probs = seq(0, 1, length.out = n + 1), type = 5)
q6 <- quantile(V, probs = seq(0, 1, length.out = n + 1), type = 6)
q8 <- quantile(V, probs = seq(0, 1, length.out = n + 1), type = 8)
q9 <- quantile(V, probs = seq(0, 1, length.out = n + 1), type = 9)
And to split the input vector:
split(V, findInterval(V, q5))
split(V, findInterval(V, q6))
split(V, findInterval(V, q8))
split(V, findInterval(V, q9))
The split
instructions above all give the same results. See below.
The values 5, 6, 8 and 9 were found with the following code:
sapply(1:9, function(i)
quantile(V, probs = seq(0, 1, length.out = n + 1), type = i)
)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#0% 1 1 1 1 1.000000 1.000000 1 1.000000 1.000000
#33.33333% 2 2 2 2 2.000000 2.000000 2 2.000000 2.000000
#66.66667% 2 2 2 2 2.166667 2.333333 2 2.222222 2.208333
#100% 5 5 5 5 5.000000 5.000000 5 5.000000 5.000000
As columns 5, 6, 8 and 9 have the 2/3
quantiles different from 2
, those types can be chosen to address the question.
The 2/3
quantiles are all between 2 and 3, that's the reason why the split
instructions all output the same list.
Upvotes: 2
Reputation: 388982
Probably you can use cut
cut(df$value, 3, labels = FALSE)
#[1] 1 1 1 1 1 1 1 2 3 3
where
df$value #is
#[1] 1 1 2 2 2 2 2 3 4 5
So 1-2 fall into group 1, 3 falls into group 2 and 4-5 in group3.
Upvotes: 2