Reputation: 2533
I have the following toy data set (the actual data set is ~500,000 records):
library(data.table)
dt <- data.table(Address = c("Gold", "Gold", "Silver", "Silver", "Gold", "Gold", "Copper", "Gold", "Bronze"),
Name = c("Stat1", "Stat1", "Stat1", "Stat1", "Stat1", "Stat1", "Stat1", "Stat1", "Stat1"),
AvgValue = c(0, 0.5, 1.25, 0.75, 1.5, 0.7, 0.41, 0.83, 2.58),
Samples = c(123, 233, 504, 3, 94, 50, 401, 402, 12))
I want to do the following:
a) subset the data so that we only consider "Gold" records" AND values in the "Value" column greater than zero
b) with the filtered data in "a" above, print out percentile and other descriptive stats.
The code to perform "a" and "b" above is as follows:
qs = dt[AvgValue > 0 & Address %like% 'Gold',
.(Samples = sum(Samples),
'25th' = quantile(AvgValue, probs = c(0.25)),
'50th' = quantile(AvgValue, probs = c(0.50)),
'75th' = quantile(AvgValue, probs = c(0.75)),
'95th' = quantile(AvgValue, probs = c(0.95)),
'99th' = quantile(AvgValue, probs = c(0.99)),
'99.9th' = quantile(AvgValue, probs = c(0.999)),
'99.99th' = quantile(AvgValue, probs = c(0.9999)),
'Mean' = mean(AvgValue),
'Median' = median(AvgValue),
'StdDev' = sd(AvgValue)),
by = .(Name, Address)]
setkey(qs, 'Name')
Printing qs
shows:
Name Address Samples 25th 50th 75th 95th 99th 99.9th 99.99th Mean Median StdDev
Stat1 Gold 779 0.65 0.765 0.9975 1.3995 1.4799 1.49799 1.499799 0.8825 0.765 0.4334647
So far, so good. These values from the (small) toy data set seem to tie out to the output from the PERCENTILE() function in MS Excel.
EDIT: Here's the problem: when I apply this R code to the larger data set, the values output by R do not tie out to the values output by the PERCENTILE() function in Excel. In the lower percentiles, the values are slightly different. In the upper percentiles, the values are significantly different. Here are the differences:
25th 50th 75th 95th 99th 99.9th 99.99th
R 0.414442227 0.428557466 0.45030771 1.668065665 42.7787092 146.9633133 349.6416913
Excel 0.414774203 0.429350073 0.448245768 0.971100779 13.31231723 98.75342572 188.2700879
And here are 20 actual data points (out of a total of 11,283 "Gold" rows). These are sorted descending:
AvgValue
349.1436739
190.189758
175.2157327
158.6492516
132.9550737
132.2686941
126.570912
122.9771829
107.6942185
99.98552912
98.93274272
98.75984129
98.73709105
98.30154271
98.2491005
96.97274385
96.94577839
96.9128099
96.90816688
96.82527478
The values from Excel seem "more correct" (especially the upper percentiles).
Does anyone see anything glaringly wrong with my R code?
If not, any ideas as to why the values from R are not tying out to the values from Excel?
Perhaps the "Type" argument for the Quantile() function (which I've not passed in)?
Thanks!
Upvotes: 1
Views: 1627
Reputation: 670
I am able to reproduce the Excel percentile
function by setting the type=7
in the R
quantile
function. See the output [[7]]]
from lapply
below and compare to what you get using Excel's percentile
on my toy vector, testveclog
:
set.seed(12272019)
testveclog <- rlnorm(11283, meanlog=-0.12, sdlog=3)
lapply(1:9, function(x) quantile(testveclog, prob=c(0.95, 0.99, 0.999), type=x))
#[[1]]
# 95% 99% 99.9%
# 131.0835 933.6057 6213.7963
#[[2]]
# 95% 99% 99.9%
# 131.0835 933.6057 6213.7963
#[[3]]
# 95% 99% 99.9%
# 131.0835 932.8875 6213.7963
#[[4]]
# 95% 99% 99.9%
# 131.0141 933.0096 6198.9585
#[[5]]
# 95% 99% 99.9%
# 131.1827 933.3687 6230.8209
#[[6]]
# 95% 99% 99.9%
# 131.3103 935.1852 6269.9696
#[[7]]
# 95% 99% 99.9%
# 131.0372 933.0168 6199.0109
#[[8]]
# 95% 99% 99.9%
# 131.2253 933.4860 6243.8705
#[[9]]
# 95% 99% 99.9%
# 131.2146 933.4567 6240.6081
writeClipboard(as.character(testveclog)) #copy and then paste into Excel to compare functions
Note that in more current versions of Excel, the PERCENTILE
function is deprecated in favor of PERCENTILE.EXC
, which matches the output from R
's quantile
function using type=6
Upvotes: 7