Reputation: 99
suppose I have a vector of size 915. Name of the vector is base
[1] 1.467352 4.651796 4.949438 5.625817 5.691591 5.839439 5.927564 7.152487 8.195661 8.640770....591.3779 591.9426 592.0126 592.3861 593.2927 593.3991 593.6104 594.1526 594.5325 594.7093
Also I have constructed another vector:
intervals <- c(0,seq(from = 1, by = 6,length.out = 100))
we can interpret this vector as intervals.
Then I want to test in which interval(vector interval
) lies each value of vector base
. For example first element of base
lies in second interval( 1.467352
doesn't lie into interval (0,1]
, but lies into (1,7]
). The same procedure I want to execute for each value in base
From this I want to create third vector, which means the number of interval in which lies i-th element of base
BUT! The maximum size of each interval is, for example, 5(One interval can consist only five elements). It means, that even if seven elements of vector base
lies in the second interval, this second interval must include only five.
third_vector = 2,2,2,2,2,3,3....
As we see, only five elements are in the second interval. 6-th and 7-th element due to the lack of space must lie into the third interval.
And the question is: how can I effectively implement this in R?
Upvotes: 1
Views: 659
Reputation: 93811
One option is to bin the data into quantiles, where the number of quantiles is set based on the maximum number of values allowed in a given interval. Below is an example. Let me know if this is what you had in mind:
# Fake data
set.seed(1)
dat = data.frame(x=rnorm(83, 10, 5))
# Cut into intervals containing no more than n values
n = 5
dat$x.bin = cut(dat$x, quantile(dat$x, seq(0,1,length=ceiling(nrow(dat)/n)+1)),
include.lowest=TRUE)
# Check
table(dat$x.bin)
[-1.07,3.62] (3.62,5.87] (5.87,6.7] (6.7,7.29] (7.29,8.2] (8.2,9.32] (9.32,9.72] 5 5 5 5 5 4 5 (9.72,9.97] (9.97,10.8] (10.8,11.7] (11.7,12.1] (12.1,12.9] (12.9,13.5] (13.5,14] 5 5 5 5 4 5 5 (14,15.5] (15.5,17.4] (17.4,22] 5 5 5
To implement @LorenzoBusetto's suggestion, you could do the following. This method ensures that every interval except the last contains n
values:
dat = dat[order(dat$x),]
dat$x.bin = 0:(nrow(dat)-1) %/% n
Upvotes: 2