Reputation: 175
Consider the following toy data and computations:
library(dplyr)
df <- tibble(x = 1)
stats::sd(df$x)
dplyr::summarise(df, sd_x = sd(x))
The first calculation results in NA
whereas the second, when the calculation is included in the dplyr function summarise
produces NaN
. I would expect both calculations to generate the same result and I wonder why they differ?
Upvotes: 9
Views: 722
Reputation: 66834
It is calling a different function. I'm not clear what the function is, but it is not the stats
one.
dplyr::summarise(df, sd_x = stats::sd(x))
# A tibble: 1 x 1
sd_x
<dbl>
1 NA
debugonce(sd) # debug to see when sd is called
Not called here:
dplyr::summarise(df, sd_x = sd(x))
# A tibble: 1 x 1
sd_x
<dbl>
1 NaN
But called here:
dplyr::summarise(df, sd_x = stats::sd(x))
debugging in: stats::sd(1)
debug: sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
na.rm = na.rm))
...
Update
It appears that the sd
within summarise
gets calculated outside of R, hinted at in this header file: https://github.com/tidyverse/dplyr/blob/master/inst/include/dplyr/Result/Sd.h
A number of functions seem to be redefined by dplyr. Given that var
gives the same result in both cases, I think the sd behaviour is a bug.
Upvotes: 6