Danny
Danny

Reputation: 488

Difference between dplyr::ntile and statar::xtile

My understanding was that dplyr::ntile and statar::xtile are trying to the same thing. But sometimes the output is different:

dplyr::ntile(1:10, 5)
# [1] 1 1 2 2 3 3 4 4 5 5

statar::xtile(1:10, 5)
# [1] 1 1 2 2 3 3 3 4 5 5

I am converting Stata code into R, so statar::xtile gives the same output as the original Stata code but I thought dplyr::ntile would be the equivalent in R.

The Stata help says that xtile is used to:

Create variable containing quantile categories

And statar::xtile is obviously replicating this.

And dplyr::ntile is:

a rough rank, which breaks the input vector into n buckets.

Do these mean the same thing?

If so, why do they give different answers?

And if not, then:

  1. What is the difference?

  2. When should you use one or the other?

Upvotes: 3

Views: 1826

Answers (1)

Danny
Danny

Reputation: 488

Thanks @alistaire for pointing out that dplyr::ntile is only doing:

function (x, n) { floor((n * (row_number(x) - 1)/length(x)) + 1) }

So not the same as splitting into quantile categories, as xtile does.

Looking at the code for statar::xtile leads to statar::pctile and the documentation for statar says that:

pctile computes quantile and weighted quantile of type 2 (similarly to Stata _pctile)

Therefore an equivalent to statar::xtile in base R is:

.bincode(1:10, quantile(1:10, seq(0, 1, length.out = 5 + 1), type = 2), 
         include.lowest = TRUE)
# [1] 1 1 2 2 3 3 3 4 5 5

Upvotes: 4

Related Questions