watchtower
watchtower

Reputation: 4298

Difference between ntile and cut and then quantile() function in R

I found two threads on this topic for calculating deciles in R. However, both the methods i.e. dplyr::ntile and quantile() yield different output. In fact, dplyr::ntile() fails to output proper deciles.

Method 1: Using ntile() From R: splitting dataset into quartiles/deciles. What is the right method? thread, we could use ntile().

Here's my code:

vector<-c(0.0242034679584454, 0.0240411606258083, 0.00519255930109344, 
  0.00948031338483081, 0.000549450549450549, 0.085972850678733, 
  0.00231687756193192, NA, 0.1131625967838, 0.00539244534707915, 
  0.0604885614579294, 0.0352030947775629, 0.00935626135385923, 
  0.401201201201201, 0.0208212839791787, NA, 0.0462887301644538, 
  0.0224952741020794, NA, NA, 0.000984952654008562)

ntile(vector,10)

The output is:

ntile(vector,10)
5  5  2  3  1  7  1 NA  8  2  7  6  3  8  4 NA  6  4 NA NA  1

If we analyze this, we see that there is no 10th quantile!

Method 2: using quantile() Now, let's use the method from How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame thread.

Here's my code:

as.numeric(cut(vector, breaks=quantile(vector, probs=seq(0,1, length  = 11), na.rm=TRUE),include.lowest=TRUE))

The output is:

 7  6  2  4  1  9  2 NA 10  3  9  7  4 10  5 NA  8  5 NA NA  1

As we can see, the outputs are completely different. What am I missing here? I'd appreciate any thoughts.

Is this a bug in ntile() function?

Upvotes: 20

Views: 28295

Answers (2)

Ann
Ann

Reputation: 125

ntile now ignores all NA together so they are not taken into account when you use ntile. The same code now yields different result:

vector<-c(0.0242034679584454, 0.0240411606258083, 0.00519255930109344, 
  0.00948031338483081, 0.000549450549450549, 0.085972850678733, 
  0.00231687756193192, NA, 0.1131625967838, 0.00539244534707915, 
  0.0604885614579294, 0.0352030947775629, 0.00935626135385923, 
  0.401201201201201, 0.0208212839791787, NA, 0.0462887301644538, 
  0.0224952741020794, NA, NA, 0.000984952654008562)

ntile(vector,10)

 6  5  2  4  1  8  2 NA  9  3  7  6  3 10  4 NA  7  5 NA NA  1

Upvotes: 0

DAVL
DAVL

Reputation: 896

In dplyr::ntile NA is always last (highest rank), and that is why you don't see the 10th decile in this case. If you want the deciles not to consider NAs, you can define a function like the one here which I use next:

ntile_na <- function(x,n)
{
  notna <- !is.na(x)
  out <- rep(NA_real_,length(x))
  out[notna] <- ntile(x[notna],n)
  return(out)
}

ntile_na(vector, 10)
# [1]  6  6  2  4  1  9  2 NA  9  3  8  7  3 10  5 NA  8  5 NA NA  1

Also, quantile has 9 ways of computing quantiles, you are using the default, which is the number 7 (you can check ?stats::quantile for the different types, and here for the discussion about them).

If you try

as.numeric(cut(vector, 
               breaks = quantile(vector, 
                                 probs = seq(0, 1, length = 11), 
                                 na.rm = TRUE,
                                 type = 2),
               include.lowest = TRUE))
# [1]  6  6  2  4  1  9  2 NA  9  3  8  7  3 10  5 NA  8  5 NA NA  1

you have the same result as the one using ntile.

In summary: it is not a bug, it is just the different ways they are implemented.

Upvotes: 31

Related Questions