pha
pha

Reputation: 356

Quantile cuts despite duplicates

I have a dataset with > 900,000 rows with many duplicates:

> sum(duplicated(df$colB))
[1] 904515

So when I try to quantile cut into ten equally large subsets, I get an error

> df$colC <- cut(df$colB, quantile(df$colB,c(0:10)/10), labels=FALSE,
+                   include.lowest=TRUE)
Error in cut.default(df$colB, quantile(df$colB,  : 
  'breaks' are not unique

Using unique(quantile(df$colB,c(0:10)/10)) doesn't give equally sized subsets. There must be an easy solution to make quantile cuts which also considers the number of rows, in addition to the values in colB. Starting a loop sequence would probably take forever as I have a high number of rows. Any ideas?

Dummy dataset:

set.seed(10)
B <- round(runif(100, 0, 0.4), digits=2)  # gives 63 duplicates
df$colB <- B
df <- as.data.frame(df)

Upvotes: 0

Views: 804

Answers (2)

r2evans
r2evans

Reputation: 160447

It might be hard to imagine, but there must be a range of values in df$colB that is invariant, so quantile returns two (or more) of a single value.

A contrived example:

set.seed(42)
vec <- c(rep(10,20),sample(100,size=80,))
brks <- quantile(vec, (0:10)/10)
brks
#    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
#   2.0  10.0  10.0  14.7  25.6  36.5  47.4  58.9  72.4  88.1 100.0 

The cut function requires that there be no repeated values in its breaks= arguments. It should be informative to look at just the quantiles of your function to confirm this.

One way around this is to use .bincode, which does not enforce unique breaks.

cut(vec, brks, include.lowest = TRUE)
# Error in cut.default(vec, brks, include.lowest = TRUE) : 
#   'breaks' are not unique

.bincode(vec, brks, include.lowest = TRUE)
#   [1]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  7  8  4  9  4
#  [26] 10  6  4  8 10  6  4  5  1  6  5  5  1  5  9  7  6 10  5  6  4  4  9  1  9
#  [51]  8 10  1  7 10  9  8  1  8  1  7  9  7  4  8  7  6  1  6  9  5  8  6 10  6
#  [76]  9  1  5  3 10  6  5  9  4  5  7 10  7  8  9  4  5  7  3  8  4 10  7  8 10

(Note that there is no "2" in the return values with this data, because brks[2] is the same as brks[3], so appears to be ignored.)

One side-effect of this is that you don't get the factor labels by default, which might be useful.

labels <- sprintf("(%0.01f-%0.01f]", brks[-10], brks[-1])
substr(labels[1], 1, 1) <- "["
labels
#  [1] "[2.0-10.0]"    "(10.0-10.0]"   "(10.0-14.7]"   "(14.7-25.6]"  
#  [5] "(25.6-36.5]"   "(36.5-47.4]"   "(47.4-58.9]"   "(58.9-72.4]"  
#  [9] "(72.4-88.1]"   "(100.0-100.0]"

head(labels[ .bincode(vec, brks, include.lowest = TRUE) ])
# [1] "[2.0-10.0]" "[2.0-10.0]" "[2.0-10.0]" "[2.0-10.0]" "[2.0-10.0]" "[2.0-10.0]"

(Where the use of %0.01f is where you may want to customize this assumption.)

Upvotes: 1

pseudospin
pseudospin

Reputation: 2767

There might be a neater solution than this, but this will do it:

df$colC <- ceiling((1:nrow(df))*10/nrow(df))[rank(df$colB, ties.method = 'first')]
table(df$colC)
#> 
#>  1  2  3  4  5  6  7  8  9 10 
#> 10 10 10 10 10 10 10 10 10 10

Upvotes: 1

Related Questions