Reputation: 2428
In the below code cut
function is being used and values are been specified but since this is a sample code its just hard coded for few but in my real case scenario we have more than 10 million records so identifying the ranges for the amount variable is quite difficult.
So my question is :
scipen=999
options(scipen=999)
id = seq(1:30)
amount = c(30185, 33894, 33642, 29439, 27879 ,52347, 4101, 5425,
6541, 54589, 5214, 1000, 45000, 64125, 100021, 120000,
657412, 15224,4578, 3639, 10000, 48781, 64484, 5020,
15001, 105050, 14521, 59822, 42871, 32542)
df = data.frame(id, amount)
df$group = cut(df$amount,c(10000, 20000, 30000, 40000, 50000, 60000, 70000))
Output for df
Upvotes: 1
Views: 72
Reputation: 311
You can let the function cut
do the work of choosing the cut points by providing a single integer n as input instead of specifying the cut points manually. The function will automatically create n equal length interval.
To adjust the number of digits used in the interval labels, set the optional input dig.lab
to the maximum number of digits of your labels.
In your example, you could use the following:
df$group = cut(df$amount,breaks=7, dig.lab=6)
Result:
> df
id amount group
1 1 30185 (343.588,94773.1]
2 2 33894 (343.588,94773.1]
3 3 33642 (343.588,94773.1]
4 4 29439 (343.588,94773.1]
5 5 27879 (343.588,94773.1]
6 6 52347 (343.588,94773.1]
7 7 4101 (343.588,94773.1]
8 8 5425 (343.588,94773.1]
9 9 6541 (343.588,94773.1]
10 10 54589 (343.588,94773.1]
11 11 5214 (343.588,94773.1]
...
Edit: To have more regular labels, set the cut points using the seq
function. For example:
> df$group = cut(df$amount,breaks=seq(0,700000,25000), dig.lab=6)
> head(df)
id amount group
1 1 30185 (25000,50000]
2 2 33894 (25000,50000]
3 3 33642 (25000,50000]
4 4 29439 (25000,50000]
5 5 27879 (25000,50000]
6 6 52347 (50000,75000]
will create cut points at a distance of 25000 one another. Note that you need to specify the min and max of the range (here I set 0 and 700000)
Upvotes: 1
Reputation: 1550
cut(x, breaks)
, breaks either a numeric vector of two or more unique cut points or a single number (greater than or equal to 2) giving the number of intervals into which x is to be cut.
You can set dig.lab
to avoid displaying in exponential.
df$group = cut(df$amount,c(10000, 20000, 30000, 40000, 50000, 60000, 70000), dig.lab = 10)
Upvotes: 0