darckeen
darckeen

Reputation: 960

cut that returns guaranteed number of bins

I'd like to do a cut with a guaranteed number of levels returned. So i'd like to take any vector of cumulative percentages and get a cut into deciles. I've tried using cut and it works well in most situations, but in cases where there are deciles that have a large percentages it fails to return the desired number of unique cuts, which is 10. Any ideas on how to ensure that the number of cuts is guaranteed to be 10?

In the included example there is no occurrance of decile 7.

> (x <- c(0.04,0.1,0.22,0.24,0.26,0.3,0.35,0.52,0.62,0.66,0.68,0.69,0.76,0.82,1.41,6.19,9.05,18.34,19.85,20.5,20.96,31.85,34.33,36.05,36.32,43.56,44.19,53.33,58.03,72.46,73.4,77.71,78.81,79.88,84.31,90.07,92.69,99.14,99.95))
 [1]  0.04  0.10  0.22  0.24  0.26  0.30  0.35  0.52  0.62  0.66  0.68  0.69  0.76  0.82  1.41  6.19  9.05 18.34 19.85 20.50 20.96 31.85 34.33
[24] 36.05 36.32 43.56 44.19 53.33 58.03 72.46 73.40 77.71 78.81 79.88 84.31 90.07 92.69 99.14 99.95
> (cut(x,seq(0,max(x),max(x)/10),labels=FALSE))
 [1]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  2  2  3  3  4  4  4  4  5  5  6  6  8  8  8  8  8  9 10 10 10 10
> (as.integer(cut2(x,seq(0,max(x),max(x)/10))))
 [1]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  2  2  3  3  4  4  4  4  5  5  6  6  8  8  8  8  8  9 10 10 10 10
> (findInterval(x,seq(0,max(x),max(x)/10),rightmost.closed=TRUE,all.inside=TRUE))
 [1]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  2  2  3  3  4  4  4  4  5  5  6  6  8  8  8  8  8  9 10 10 10 10

I would like to get 10 approximately equally sized intervals, sized in such a way that I am assured of getting 10. cut et al gives 9 bins with this example, i want 10. So I'm looking for an algorithm that would recognize that the break between [58.03,72.46],73.4 is large. Instead of assigning to bins 6,8,8 it would assign these cases to bins 6,7,8.

Upvotes: 2

Views: 1692

Answers (4)

Reynaldo Morillo
Reynaldo Morillo

Reputation: 21

numBins = 10
cut(x, breaks = seq(from = min(x), to = max(x), length.out = numBins+1))

Output:

...
...
...
10 Levels: (0.04,10] (10,20] (20,30] (30,40] (40,50] (50,60] ... (90,100]

This will make 10 bins that are approximately equally spaced. Note, that by changing the numBins variable, you may obtain any number of bins that are approximately equally spaced.

Upvotes: 2

Carl Witthoft
Carl Witthoft

Reputation: 21532

What is the problem you are trying to solve? If you don't want quantiles, then your cutpoints are pretty much arbitrary, so you could just as easily create ten bins by sampling without replacement from your original dataset. I realize that's an absurd method, but I want to make a point: you may be way off track but we can't tell because you haven't explained what you intend to do with your bins. Why, for example, is it so bad that one bin has no content?

Upvotes: -1

IRTFM
IRTFM

Reputation: 263481

xx <- cut(x, breaks=quantile(x, (1:10)/10, na.rm=TRUE) )
table(xx)
#------------------------
    xx
(0.256,0.58] (0.58,0.718] (0.718,6.76]  (6.76,20.5] 
           4            4            4            4 
 (20.5,35.7]  (35.7,49.7]  (49.7,75.1]  (75.1,85.5] 
           3            4            4            4 
  (85.5,100] 
           4 

Upvotes: 4

Jason Morgan
Jason Morgan

Reputation: 2330

Not sure I understand what you need, but if you drop the labels=FALSE and use table to make a frequency table of your data, you will get the number of categories desired:

> table(cut(x, breaks=seq(0, 100, 10)))

(0,10]  (10,20]  (20,30]  (30,40]  (40,50]  (50,60]  (60,70]  (70,80]  (80,90] (90,100] 
   17        2        2        4        2        2        0        5        1        4

Notice that there are is no data in the 7th category, (60,70].

Upvotes: 1

Related Questions