Reputation: 329

split a vector by percentile

I need to split a sorted unknown length vector in R into "top 10%,..., bottom 10%" So, for example if I have vector <- order(c(1:98928)), I want to split it into 10 different vectors, each one representing approximately 10% of the total length.

Ive tried using split <- split(vector, 1:10) but as I dont know the length of the vector, I get this error if its not multiple

data length is not a multiple of split variable

And even if its multiple and the function works, split() does not keep the order of my original vector. This is what split gives:

split(c(1:10) , 1:2)
$`1`
[1] 1 3 5 7 9

$`2`
[1]  2  4  6  8 10

And this is what I want:

$`1`
[1] 1 2 3 4 5

$`2`
[1]  6  7  8  9 10

Im newbie in R and Ive been trying lots of things without success, does anyone knows how to do this?

Upvotes: 10

Answers (5)

Daniel Pinto S

Reputation: 11

You can use the sum() function to determine the positions to extract a section of the vector. Using a logical operator greater than (>) or less than (<) the percentile value you are indicating. Since sum() assigns the value of 1 if TRUE and 0 if FALSE. It is important to order the elements of the vector first.

# A vector with numbers from 1 to 100
data <- seq(1,100)

# 25th percentile value and 75th percentile value
ps1 <- quantile(data,probs=c(0.25))
ps2 <- quantile(data,probs=c(0.75))

# Positions to split
position1 <- sum(data<=ps1)
position2 <- sum(data<=ps2)

# Split with positions in a sorted data
sort(data)[position1:position2]

The result is

25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75

In the same way you can divide an ordered vector into 10 equal parts in the following way, specifying the percentiles

# A vector with numbers from 1 to 100
data <- seq(1,100)

# sub vectors based on percentiles
subvectors <- quantile(data,probs=c(0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,1))

for (i in 1:length(subvectors)-1){
  
  # Percentiles values
  ps1 <- subvectors[i]
  ps2 <- subvectors[i+1]
  
  # Positions to split
  position1 <- sum(data<=ps1)
  position2 <- sum(data<=ps2)
  
  # Split with positions in a sorted data
  print(sort(data)[position1:position2])
}

Upvotes: 0

Slouei

Reputation: 756

If you have your vector as a column (named vec) in a data frame, you can simply do something like this:

df$new_vec <- cut(df$vec , breaks = quantile(df$vec, c(0, .1,.., 1)), 
                labels=1:10, include.lowest=TRUE)

Upvotes: 5

Zheyuan Li

Reputation: 73415

Problem statement

Break a sorted vector x every 10% into 10 chunks.

Note there are two interpretation for this:

Cutting by vector index:

split(x, floor(10 * seq.int(0, length(x) - 1) / length(x)))

Cutting by vector values (say, quantiles):

split(x, cut(x, quantile(x, prob = 0:10 / 10, names = FALSE), include = TRUE))

In the following, I will make demonstration using data:

set.seed(0); x <- sort(round(rnorm(23),1))

Particularly, our example data are Normally distributed rather than uniformly distributed, so cutting by index and cutting by value are substantially different.

Result

cutting by index

#$`0`
#[1] -1.5 -1.2 -1.1
#
#$`1`
#[1] -0.9 -0.9
#
#$`2`
#[1] -0.8 -0.4
#
#$`3`
#[1] -0.3 -0.3 -0.3
#
#$`4`
#[1] -0.3 -0.2
#
#$`5`
#[1] 0.0 0.1
#
#$`6`
#[1] 0.3 0.4 0.4
#
#$`7`
#[1] 0.4 0.8
#
#$`8`
#[1] 1.3 1.3
#
#$`9`
#[1] 1.3 2.4

cutting by quantile

#$`[-1.5,-1.06]`
#[1] -1.5 -1.2 -1.1
#
#$`(-1.06,-0.86]`
#[1] -0.9 -0.9
#
#$`(-0.86,-0.34]`
#[1] -0.8 -0.4
#
#$`(-0.34,-0.3]`
#[1] -0.3 -0.3 -0.3 -0.3
#
#$`(-0.3,-0.2]`
#[1] -0.2
#
#$`(-0.2,0.14]`
#[1] 0.0 0.1
#
#$`(0.14,0.4]`
#[1] 0.3 0.4 0.4 0.4
#
#$`(0.4,0.64]`
#numeric(0)
#
#$`(0.64,1.3]`
#[1] 0.8 1.3 1.3 1.3
#
#$`(1.3,2.4]`
#[1] 2.4

Upvotes: 8

user31264

Reputation: 6737

x <- 1:98
y <- split(x, ((seq(length(x))-1)*10)%/%length(x)+1)

Explanation:

seq(length(x)) = 1..98

seq(length(x))-1 = 0..97

(seq(length(x))-1)*10 = (0, 10, ..., 970)

# each number about 10% of values, totally 98
((seq(length(x))-1)*10)%/%length(x) = (0, ..., 0, 1, ..., 1, ..., 9, ..., 9) 

# each number about 10% of values, totally 98
seq(length(x))-1)*10)%/%length(x)+1 = (1, ..., 1, 2, ..., 2, ..., 10, ..., 10)  

# splits first ~10% of numbers to 1, next ~10% of numbers to 2 etc.
split(x, ((seq(length(x))-1)*10)%/%length(x)+1)

Upvotes: 4

akuiper

Reputation: 215137

If the vector is sorted, then you could just create a group variable with the same length of vector and split on it. In real case, it will require a little more effort since the length of the vector may not be a multiple of 10 but for your toy example, you can do:

n = 2
split(x, rep(1:n, each = length(x)/n))
# $`1`
# [1] 1 2 3 4 5

# $`2`
# [1]  6  7  8  9 10

A real case example, where the vector's length is not a multiple of the number of groups:

vec = 1:13
n = 3
split(vec, sort(seq_along(vec)%%n))
# $`0`
# [1] 1 2 3 4

# $`1`
# [1] 5 6 7 8 9

# $`2`
# [1] 10 11 12 13

Upvotes: 2

split a vector by percentile

Answers (5)

Problem statement

Result

Related Questions