Reputation: 55
I am using the cut() function (base r) on two similar sets of data. On one set, I get the excepted output with cuts like (0.0253,0.0263] and on the other I get the range output mentioned in the help documentation (like Range_75). I am unable to identify what is different about my data that is causing this difference and would like some help in figuring out what the differences is.
If I could reproduce the example, I would share code to recreate it. So instead, here is information on my data & code:
The same line of code is used in a for loop, so both sets of data are being treated the same (temp_c is a data.frame, shown below):
temp_d<-as.numeric(temp_c[,1])
temp_c$grouping<-with(temp_c,cut(temp_d,breaks=quantile_c_temp,include.lowest=TRUE))
Here is what my temp_c data looks like: head() for the data with the expected output (which I will call data_expected): data is in a data.frame and both columns are numeric
var retention
1 0.00000000 1
2 0.02564103 0
3 0.00000000 0
4 0.00000000 1
5 0.00000000 0
6 0.21518987 1
head() for the data with the unexpected output (which I will call data_unexpected)
var retention
1 0.31578947 1
2 0.28205128 0
3 0.25000000 0
4 0.00000000 1
5 0.04166667 0
6 0.15189873 1
Here are the breaks used in the cut function for data_expected (aka quantile_c_temp):
[1] 0.000000000 0.008547009 0.010526316 0.012195122
[5] 0.013698630 0.015384615 0.016949153 0.018181818
[9] 0.019607843 0.020408163 0.021739130 0.022988506
[13] 0.024390244 0.025316456 0.026315789 0.027777778
[17] 0.029411765 0.030303030 0.032258065 0.033333333
[21] 0.034482759 0.035714286 0.037500000 0.039215686
[25] 0.040816327 0.041666667 0.043478261 0.045454545
[29] 0.047058824 0.048780488 0.050000000 0.052631579
[33] 0.054054054 0.055555556 0.058823529 0.060606061
[37] 0.062500000 0.065573770 0.068181818 0.071428571
[41] 0.073688109 0.076923077 0.078625892 0.082226461
[45] 0.084905660 0.089108911 0.091801020 0.095890411
[49] 0.100000000 0.103896104 0.108020556 0.111111111
[53] 0.117647059 0.122448980 0.127659574 0.134134819
[57] 0.142857143 0.148378041 0.156960784 0.166666667
[61] 0.185028180 0.200000000 0.238475317 0.500000000
Here are the breaks used in the cut function for data_unexpected (aka quantile_c_temp):
[1] 0.00000000 0.01936819 0.03333333 0.04347826
[5] 0.05071780 0.05802157 0.06422018 0.06896552
[9] 0.07374374 0.07692308 0.08180891 0.08571429
[13] 0.09090909 0.09382131 0.09756098 0.10000000
[17] 0.10526316 0.10810811 0.11111111 0.11538462
[21] 0.11764706 0.12244898 0.12500000 0.12820513
[25] 0.13157895 0.13422000 0.13793103 0.14167717
[29] 0.14285714 0.14583333 0.14934809 0.15254237
[33] 0.15501802 0.15789474 0.16000000 0.16363636
[37] 0.16666667 0.16850635 0.17241379 0.17543860
[41] 0.17777778 0.18181818 0.18333333 0.18750000
[45] 0.18965517 0.19230769 0.19565217 0.20000000
[49] 0.20560880 0.20833333 0.21188012 0.21428571
[53] 0.21875000 0.22222222 0.22448980 0.22825348
[57] 0.23076923 0.23529412 0.23809524 0.24137931
[61] 0.24590164 0.25000000 0.25396115 0.25862069
[65] 0.26315789 0.26732673 0.27272727 0.27536232
[69] 0.28000000 0.28571429 0.28813559 0.29411765
[73] 0.30000000 0.30434783 0.31050037 0.31578947
[77] 0.32485811 0.33333333 0.33333333 0.34545455
[81] 0.35646771 0.36363636 0.37500000 0.38461538
[85] 0.39393939 0.40740741 0.42857143 0.44444444
[89] 0.46341463 0.49573770 0.51424242 0.57142857
[93] 0.66666667 1.00000000
As far as I can tell, the cuts produced by my code and data should either both be of the (0.0253,0.0263] type or both be of the Range_75 type. Does anyone have any idea why the cut-types are different?
Edit: I ran dput(head(dat, 10)) on both data sets and got the following: data_expected:
structure(list(var = c(0, 0.0256410256410256, 0, 0, 0, 0.215189873417722,
0.027027027027027, 0, 0.0476190476190476, 0), retention = c(1,
0, 0, 1, 0, 1, 0, 1, 1, 1)), .Names = c("var", "retention"), row.names = c(NA,
10L), class = "data.frame")
data_unexpected:
structure(list(var = c(0.315789473684211, 0.282051282051282,
0.25, 0, 0.0416666666666667, 0.151898734177215, 0.378378378378378,
0, 0.0238095238095238, 0.208333333333333), retention = c(1, 0,
0, 1, 0, 1, 0, 1, 1, 1)), .Names = c("var", "retention"), row.names = c(NA,
10L), class = "data.frame")
My data is 8414 rows and when I subsetted down to 8411, the cuts were correct. There is something about row 8412. tail(data_unexpected)
var retention
8409 0.05069124 1
8410 0.31034483 1
8411 0.26027397 0
8412 0.32116788 1
8413 0.10059172 1
8414 0.16666667 0
Upvotes: 1
Views: 575
Reputation: 226936
The Range_*
labels get invoked when cut
can't create unique numeric labels properly with the specified number of digits:
‘dig.lab’ indicates the minimum number of digits [that] should be used in formatting the numbers ‘b1’, ‘b2’, .... A larger value (up to 12) will be used if needed to distinguish between any pair of endpoints: if this fails labels such as ‘"Range3"’ will be used.
Here's an example differentiating the two cases:
r1 <- 1+(1:4)*1e-15
cut(r1,r1)
## [1] <NA> Range_1 Range_2 Range_3
## Levels: Range_1 Range_2 Range_3
r2 <- 1+(1:4)*1e-3
cut(r2,r2)
## [1] <NA> (1.001,1.002] (1.002,1.003] (1.003,1.004]
## Levels: (1.001,1.002] (1.002,1.003] (1.003,1.004]
So one of your data sets has a set of cuts (quantile_c_temp
that are so close together that their numeric representations are identical up to at least three digits. You can probably increase dig.lab
from its default value of 3 to solve problem.
Upvotes: 3