Reputation: 1048
I have a dataframe of agents and their corresponding number of products sold
Gent_Code number_policies
A096 3
A0828 12
A0843 2
A0141 2
B079 7
B05 3
M012 5
P010 2
S039 3
I want to calculate the percentile at which each value(xi) lies such that p% of the values in the data are below xi. The minimum value of the percentile would be 0 and max would be very near to 1 but not 1.
I have done the below:
ag_df <- mutate(ag_df, pon_percentiles = ecdf(ag_df$pon)(ag_df$pon))
summary(ag_df$pon_percentiles )
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.4805 0.4805 0.6417 0.6356 0.7738 1.0000
However, I want the percentile formula to calculate below a value and not below or equal to the value.
Hence, the value of percentile for the minimum value in the vector should be 0 and max value should get a percentile close to 1 but not exactly 1.
Current output:
0.6666667 1.0000000 0.3333333 0.3333333 0.8888889 0.6666667 0.7777778 0.3333333 0.6666667
If we see the above output, for min of number_policies (2) the value is 0.3333 , but I would like this to be 0. For max which is 12, it should not be 1 but 0.99.
How do I do this in R? I have searched for relevant arguments amongst the base functions like ecdf, cume_distr etc but could not find any. Can someone please help me with this?
Upvotes: 0
Views: 976
Reputation: 767
One solution using the percent_rank()
function would be:
pkgs <- c("tidyverse", "stringi")
invisible(lapply(pkgs, require, character.only = TRUE))
set.seed(2)
n <- 30
db <- tibble(gent_code = paste0(stri_rand_strings(n, 1, '[A-Z]'),
stri_rand_strings(n, 4, '[0-9]')),
nr_pol = sample(1L:100L, n, TRUE))
db %>%
mutate(percentile = percent_rank(nr_pol)) %>%
print(n = n)
which gives the output:
gent_code nr_pol percentile
<chr> <int> <dbl>
1 E0188 35 0.241
2 S5682 91 0.862
3 O6192 96 0.931
4 E1197 97 1.000
5 Y9358 39 0.345
6 Y0069 63 0.552
7 D2879 14 0.138
8 V6778 25 0.172
9 M6284 75 0.759
10 O3420 69 0.690
11 O2301 35 0.241
12 G1728 3 0.0345
13 T4536 38 0.310
14 E0418 1 0
15 K9373 44 0.414
16 W9335 66 0.621
17 Z4140 58 0.448
18 F1424 62 0.517
19 L9825 96 0.931
20 B8411 59 0.483
21 R0735 41 0.379
22 K8881 81 0.793
23 V9502 87 0.828
24 D9827 5 0.0690
25 J5363 8 0.103
26 M2909 68 0.655
27 D3658 94 0.897
28 J1312 34 0.207
29 Z6347 63 0.552
30 D6342 72 0.724
As you see it starts at 0 as you want, but the highest percentile will be equal to 1, because it reflects the highest number of policies in your data.
EDIT: Forcing 12 in this case to be equal to e.g. the 99th precentile implies that you have data points higher than 12 in the data. It will be equal to 1 because all of your datapoints are less than or equal to this value.
Upvotes: 2
Reputation: 133
x <- c(3, 12, 2, 2, 7, 3, 5, 2, 3)
(1) Min value 2 is 0% percentile, then you need to remove min value from your vector. (2) Max value 12 is 99% percentile, then you need to add a larger value than max value and fill your vector with max value so as a vector length to be 100.
x1 <- c(x[x > min(x)], Inf)
x2 <- c(x1, rep(max(x), 100 - length(x1)))
ecdf(x2)(x)
> ecdf(x2)(x)
[1] 0.03 0.99 0.00 0.00 0.05 0.03 0.04 0.00 0.03
Upvotes: 0
Reputation: 115
I think this is what you want but I'm not sure, you just have to setup the labels
and probs
the way you would like to have it.
iris2 <- iris
iris2$quartile_number <- cut(iris$Sepal.Length,
quantile(iris$Sepal.Length) ,
include.lowest=T,
labels=c(.25, .5, .75, 1))
head(iris2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species quartile_number
1 5.1 3.5 1.4 0.2 setosa 0.25
2 4.9 3.0 1.4 0.2 setosa 0.25
3 4.7 3.2 1.3 0.2 setosa 0.25
4 4.6 3.1 1.5 0.2 setosa 0.25
5 5.0 3.6 1.4 0.2 setosa 0.25
6 5.4 3.9 1.7 0.4 setosa 0.5
Upvotes: 0
Reputation: 416
You simply can do this by quantile function:
quantile(df, probs = c(0, 0.24, 0.49, 0.74, 0.99))
Hope that helps!!!
Upvotes: 0