Quantile cuts despite duplicates

Question

I have a dataset with > 900,000 rows with many duplicates:

> sum(duplicated(df$colB))
[1] 904515

So when I try to quantile cut into ten equally large subsets, I get an error

> df$colC <- cut(df$colB, quantile(df$colB,c(0:10)/10), labels=FALSE,
+                   include.lowest=TRUE)
Error in cut.default(df$colB, quantile(df$colB,  : 
  'breaks' are not unique

Using unique(quantile(df$colB,c(0:10)/10)) doesn't give equally sized subsets. There must be an easy solution to make quantile cuts which also considers the number of rows, in addition to the values in colB. Starting a loop sequence would probably take forever as I have a high number of rows. Any ideas?

Dummy dataset:

set.seed(10)
B <- round(runif(100, 0, 0.4), digits=2)  # gives 63 duplicates
df$colB <- B
df <- as.data.frame(df)

pseudospin · Accepted Answer

There might be a neater solution than this, but this will do it:

df$colC <- ceiling((1:nrow(df))*10/nrow(df))[rank(df$colB, ties.method = 'first')]
table(df$colC)
#> 
#>  1  2  3  4  5  6  7  8  9 10 
#> 10 10 10 10 10 10 10 10 10 10

Quantile cuts despite duplicates

Answers (2)

Related Questions