Reputation: 319
I need to split a sorted unknown length vector in R into "top 10%,..., bottom 10%"
So, for example if I have vector <- order(c(1:98928))
, I want to split it into 10 different vectors, each one representing approximately 10% of the total length.
Ive tried using split <- split(vector, 1:10)
but as I dont know the length of the vector, I get this error if its not multiple
data length is not a multiple of split variable
And even if its multiple and the function works, split()
does not keep the order of my original vector. This is what split gives:
split(c(1:10) , 1:2)
$`1`
[1] 1 3 5 7 9
$`2`
[1] 2 4 6 8 10
And this is what I want:
$`1`
[1] 1 2 3 4 5
$`2`
[1] 6 7 8 9 10
Im newbie in R and Ive been trying lots of things without success, does anyone knows how to do this?
Upvotes: 9
Views: 7550
Reputation: 11
You can use the sum() function to determine the positions to extract a section of the vector. Using a logical operator greater than (>) or less than (<) the percentile value you are indicating. Since sum() assigns the value of 1 if TRUE and 0 if FALSE. It is important to order the elements of the vector first.
# A vector with numbers from 1 to 100
data <- seq(1,100)
# 25th percentile value and 75th percentile value
ps1 <- quantile(data,probs=c(0.25))
ps2 <- quantile(data,probs=c(0.75))
# Positions to split
position1 <- sum(data<=ps1)
position2 <- sum(data<=ps2)
# Split with positions in a sorted data
sort(data)[position1:position2]
The result is
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
In the same way you can divide an ordered vector into 10 equal parts in the following way, specifying the percentiles
# A vector with numbers from 1 to 100
data <- seq(1,100)
# sub vectors based on percentiles
subvectors <- quantile(data,probs=c(0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,1))
for (i in 1:length(subvectors)-1){
# Percentiles values
ps1 <- subvectors[i]
ps2 <- subvectors[i+1]
# Positions to split
position1 <- sum(data<=ps1)
position2 <- sum(data<=ps2)
# Split with positions in a sorted data
print(sort(data)[position1:position2])
}
Upvotes: 0
Reputation: 756
If you have your vector as a column (named vec
) in a data frame, you can simply do something like this:
df$new_vec <- cut(df$vec , breaks = quantile(df$vec, c(0, .1,.., 1)),
labels=1:10, include.lowest=TRUE)
Upvotes: 5
Reputation: 73315
Break a sorted vector x
every 10% into 10 chunks.
Note there are two interpretation for this:
Cutting by vector index:
split(x, floor(10 * seq.int(0, length(x) - 1) / length(x)))
Cutting by vector values (say, quantiles):
split(x, cut(x, quantile(x, prob = 0:10 / 10, names = FALSE), include = TRUE))
In the following, I will make demonstration using data:
set.seed(0); x <- sort(round(rnorm(23),1))
Particularly, our example data are Normally distributed rather than uniformly distributed, so cutting by index and cutting by value are substantially different.
cutting by index
#$`0`
#[1] -1.5 -1.2 -1.1
#
#$`1`
#[1] -0.9 -0.9
#
#$`2`
#[1] -0.8 -0.4
#
#$`3`
#[1] -0.3 -0.3 -0.3
#
#$`4`
#[1] -0.3 -0.2
#
#$`5`
#[1] 0.0 0.1
#
#$`6`
#[1] 0.3 0.4 0.4
#
#$`7`
#[1] 0.4 0.8
#
#$`8`
#[1] 1.3 1.3
#
#$`9`
#[1] 1.3 2.4
cutting by quantile
#$`[-1.5,-1.06]`
#[1] -1.5 -1.2 -1.1
#
#$`(-1.06,-0.86]`
#[1] -0.9 -0.9
#
#$`(-0.86,-0.34]`
#[1] -0.8 -0.4
#
#$`(-0.34,-0.3]`
#[1] -0.3 -0.3 -0.3 -0.3
#
#$`(-0.3,-0.2]`
#[1] -0.2
#
#$`(-0.2,0.14]`
#[1] 0.0 0.1
#
#$`(0.14,0.4]`
#[1] 0.3 0.4 0.4 0.4
#
#$`(0.4,0.64]`
#numeric(0)
#
#$`(0.64,1.3]`
#[1] 0.8 1.3 1.3 1.3
#
#$`(1.3,2.4]`
#[1] 2.4
Upvotes: 8
Reputation: 6727
x <- 1:98
y <- split(x, ((seq(length(x))-1)*10)%/%length(x)+1)
Explanation:
seq(length(x)) = 1..98
seq(length(x))-1 = 0..97
(seq(length(x))-1)*10 = (0, 10, ..., 970)
# each number about 10% of values, totally 98
((seq(length(x))-1)*10)%/%length(x) = (0, ..., 0, 1, ..., 1, ..., 9, ..., 9)
# each number about 10% of values, totally 98
seq(length(x))-1)*10)%/%length(x)+1 = (1, ..., 1, 2, ..., 2, ..., 10, ..., 10)
# splits first ~10% of numbers to 1, next ~10% of numbers to 2 etc.
split(x, ((seq(length(x))-1)*10)%/%length(x)+1)
Upvotes: 4
Reputation: 214957
If the vector is sorted, then you could just create a group variable with the same length of vector and split on it. In real case, it will require a little more effort since the length of the vector may not be a multiple of 10 but for your toy example, you can do:
n = 2
split(x, rep(1:n, each = length(x)/n))
# $`1`
# [1] 1 2 3 4 5
# $`2`
# [1] 6 7 8 9 10
A real case example, where the vector's length is not a multiple of the number of groups:
vec = 1:13
n = 3
split(vec, sort(seq_along(vec)%%n))
# $`0`
# [1] 1 2 3 4
# $`1`
# [1] 5 6 7 8 9
# $`2`
# [1] 10 11 12 13
Upvotes: 2