Daniel
Daniel

Reputation: 57

Does dplyr::row_number() calculate row number for each obs? If so, how?

On the tidyverse website reference, I saw two usage mutate(mtcars, row_number() == 1L) and mtcars %>% filter(between(row_number(), 1, 10)). It would be straight forward to think that the row_number() function is return the row number for each observation in the dataframe.

However, it has been emphasized in the documentation that the function is a window function and is similar to sortperm in other languages. As in the example:

x <- c(5, 1, 3, 2, 2, NA)
row_number(x)
# [1]  5  1  4  2  3 NA

May I ask if this function is intended to report the row number for each observations? If it is, what is the logic flow behind the function call?

Thanks!

Upvotes: 2

Views: 253

Answers (1)

Julius Vainora
Julius Vainora

Reputation: 48211

As ?row_number says, row_number is equivalent to rank(ties.method = "first"), where rank (see ?rank) returns the sample ranks of the values in a vector and using "first" results in a permutation with increasing values at each index set of ties:

row_number
# function (x) 
# rank(x, ties.method = "first", na.last = "keep")
# <bytecode: 0x108538478>
# <environment: namespace:dplyr>

So,

x <- c(5, 1, 3, 2, 2, NA)
row_number(x)
# [1]  5  1  4  2  3 NA
rank(x, ties = "first", na.last = "keep") # I added na.last = "keep" to fully replicate row_number
# [1]  5  1  4  2  3 NA

since

sort(x)
# [1] 1 2 2 3 5

and we gave a lower rank to the first 2 due to ties = "first".

Now when we use simply row_number() in filter, mutate calls, then indeed it seems to simply return a vector of row numbers, as can be found here.

Upvotes: 3

Related Questions