Reputation: 356
I have a dataset with > 900,000 rows with many duplicates:
> sum(duplicated(df$colB))
[1] 904515
So when I try to quantile cut into ten equally large subsets, I get an error
> df$colC <- cut(df$colB, quantile(df$colB,c(0:10)/10), labels=FALSE,
+ include.lowest=TRUE)
Error in cut.default(df$colB, quantile(df$colB, :
'breaks' are not unique
Using unique(quantile(df$colB,c(0:10)/10))
doesn't give equally sized subsets. There must be an easy solution to make quantile cuts which also considers the number of rows, in addition to the values in colB
. Starting a loop sequence would probably take forever as I have a high number of rows. Any ideas?
Dummy dataset:
set.seed(10)
B <- round(runif(100, 0, 0.4), digits=2) # gives 63 duplicates
df$colB <- B
df <- as.data.frame(df)
Upvotes: 0
Views: 804
Reputation: 160447
It might be hard to imagine, but there must be a range of values in df$colB
that is invariant, so quantile
returns two (or more) of a single value.
A contrived example:
set.seed(42)
vec <- c(rep(10,20),sample(100,size=80,))
brks <- quantile(vec, (0:10)/10)
brks
# 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
# 2.0 10.0 10.0 14.7 25.6 36.5 47.4 58.9 72.4 88.1 100.0
The cut
function requires that there be no repeated values in its breaks=
arguments. It should be informative to look at just the quantiles of your function to confirm this.
One way around this is to use .bincode
, which does not enforce unique breaks.
cut(vec, brks, include.lowest = TRUE)
# Error in cut.default(vec, brks, include.lowest = TRUE) :
# 'breaks' are not unique
.bincode(vec, brks, include.lowest = TRUE)
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 7 8 4 9 4
# [26] 10 6 4 8 10 6 4 5 1 6 5 5 1 5 9 7 6 10 5 6 4 4 9 1 9
# [51] 8 10 1 7 10 9 8 1 8 1 7 9 7 4 8 7 6 1 6 9 5 8 6 10 6
# [76] 9 1 5 3 10 6 5 9 4 5 7 10 7 8 9 4 5 7 3 8 4 10 7 8 10
(Note that there is no "2" in the return values with this data, because brks[2]
is the same as brks[3]
, so appears to be ignored.)
One side-effect of this is that you don't get the factor
labels by default, which might be useful.
labels <- sprintf("(%0.01f-%0.01f]", brks[-10], brks[-1])
substr(labels[1], 1, 1) <- "["
labels
# [1] "[2.0-10.0]" "(10.0-10.0]" "(10.0-14.7]" "(14.7-25.6]"
# [5] "(25.6-36.5]" "(36.5-47.4]" "(47.4-58.9]" "(58.9-72.4]"
# [9] "(72.4-88.1]" "(100.0-100.0]"
head(labels[ .bincode(vec, brks, include.lowest = TRUE) ])
# [1] "[2.0-10.0]" "[2.0-10.0]" "[2.0-10.0]" "[2.0-10.0]" "[2.0-10.0]" "[2.0-10.0]"
(Where the use of %0.01f
is where you may want to customize this assumption.)
Upvotes: 1
Reputation: 2767
There might be a neater solution than this, but this will do it:
df$colC <- ceiling((1:nrow(df))*10/nrow(df))[rank(df$colB, ties.method = 'first')]
table(df$colC)
#>
#> 1 2 3 4 5 6 7 8 9 10
#> 10 10 10 10 10 10 10 10 10 10
Upvotes: 1