BMM
BMM

Reputation: 63

How to categorize a large integer vector efficiently in R

I've a large integer vector (part of the data shown below):

a <- c(0,0,0,1,1,2,2,2,4,4,7,7,7,35,50,50, 200,200,500,500,500, 2500,2501,2502,2502)

I would like to create another vector (vector b) that categorizes vector a values into bins. The category values should be 1 for vector a values 0 - 6, 2 for 7 - 13, 3 for 14 - 20 ...

I know I can use the dplyr case_when() function to mutate but when the data is big it may not be efficient.

Upvotes: 0

Views: 74

Answers (1)

IRTFM
IRTFM

Reputation: 263451

The best way to categorize numeric data into ranges with a numeric output value is the findInterval function. Examples:

> a <- c(0,0,0,1,1,2,2,2,4,4,7,7,7,35,50,50, 200,200,500,500,500, 2500,2501,2502,2502)
> findInterval( a, c(0, 6, 12, 18, 24))
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 5 5 5 5 5 5 5 5 5 5 5 5
> findInterval( a, 6^(0:6))
 [1] 0 0 0 1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 5 5 5 5
> 6^(0:6)
[1]     1     6    36   216  1296  7776 46656

Note that the value returned for items below the min value in the second argument is 0 and the value for items above the max is the length of the vec (i.e breaks) vector. The intervals are left-closed, right-open, which is the opposite of how the cut function behaves (unless changed by parameters).

Upvotes: 3

Related Questions