Reputation: 615
I have this column in a dataframe, df:
10
12
75
89
25
100
67
89
4
67
120.2
140.5
170.5
78.1
I am trying to
Basically, I am trying to create an output like this (These values are rough estimates, not based on actual percentiles)
10 2
12 2
75 4
89 5
25 3
100 6
67 4
89 6
4 1
67 5
120.2 7
140.5 8
170.5 10
78.1 5
I don't know how to create these percentiles and assign values.
So far I have only been able to do this
quantile(df, c(.1, .2,.3,.4,.5,.6,.7,.8,.9))
Could you please help?
Thanks
Upvotes: 0
Views: 3055
Reputation: 726
Or you could use the data.table
way which is much much faster:
library(data.table)
d = data.table(V1 = c(10, 12, 75, 89, 25, 100, 67, 89, 4, 67, 120.2, 140.5, 170.5, 78.1))
d[, percentile:= findInterval(V1, quantile(V1, seq(0,1, by=.1), na.rm=T))]
Upvotes: 0
Reputation: 8146
Assuming your values are in x
, I would do something like
floor((x-min(x)) / (max(x)-min(x)) *10)
The fraction (x-min(x)) / (max(x)-min(x))
says how far into the range we are. The denominator (max(x)-min(x))
gets the range, in this case 170.5 - 4 = 166.5
. The numerator x-min(x)
calculates how far into the range we are.
We then multiply by 10
and round down to get you a number between 0
and 10
.
> floor((x-min(x)) / (max(x)-min(x))*10)
[1] 0 0 4 5 1 5 3 5 0 3 6 8 10 4
This gives different results to what you gave in your example, but with the lowest number being 4
, then for example 10
clearly falls into the 0
to 10%
category not the 20
to 30%
category.
Edit: if you're interested in the order of the values, you can do
floor(rank(x) / length(x) * 10)
rank
gives
2.0 3.0 7.0 9.5 4.0 11.0 5.5 9.5 1.0 5.5 12.0 13.0 14.0 8.0
and then you divide by the length (14
) and similarly to the above multiply by 10
and use floor
> floor(rank(x) / length(x) * 10)
[1] 1 2 5 6 2 7 3 6 0 3 8 9 10 5
Upvotes: 1
Reputation: 44614
This is one option:
d <- read.table(text="10
12
75
89
25
100
67
89
4
67
120.2
140.5
170.5
78.1")
transform(d, percentile=findInterval(V1, quantile(V1, seq(0,1, by=.1))))
# HEAD:
#
# V1 percentile
# 1 10 1
# 2 12 2
# 3 75 5
# 4 89 7
# 5 25 3
#
# --- 4 rows omitted ---
#
# TAIL:
# V1 percentile
# 10 67.0 4
# 11 120.2 9
# 12 140.5 10
# 13 170.5 11
# 14 78.1 6
Upvotes: 3
Reputation: 206536
It's somewhat unclear how you were calculating your breaks based on your example, but from your description, this should work
x<-c(10, 12, 75, 89, 25, 100, 67, 89, 4, 67, 120.2, 140.5, 170.5, 78.1)
qq<-cut(x, breaks=seq(min(x),max(x), length.out=11), include.lowest=T, labels=F)
Here we take the vector and split it into 10 groups with seq(min(x),max(x), length.out=11)
(thereby using the range of observed values) so the groups are
1: 4-26.65
2: 20.65-37.30
3: 37.30-53.95
....
10: 170.50-153.85
And then we just use the cut
command with those breaks to create the new IDs to get
# cbind(x, qq)
x qq
[1,] 10.0 1
[2,] 12.0 1
[3,] 75.0 5
[4,] 89.0 6
[5,] 25.0 2
[6,] 100.0 6
[7,] 67.0 4
[8,] 89.0 6
[9,] 4.0 1
[10,] 67.0 4
[11,] 120.2 7
[12,] 140.5 9
[13,] 170.5 10
[14,] 78.1 5
If you actually wanted to use quantiles, you could use
qq<-cut(x, breaks=quantile(x, seq(0,1, length.out=11)), include.lowest=T, labels=F)
which would give
x qq
[1,] 10.0 1
[2,] 12.0 2
[3,] 75.0 5
[4,] 89.0 7
[5,] 25.0 3
[6,] 100.0 8
[7,] 67.0 4
[8,] 89.0 7
[9,] 4.0 1
[10,] 67.0 4
[11,] 120.2 9
[12,] 140.5 10
[13,] 170.5 10
[14,] 78.1 6
Upvotes: 0