Getting different results when using the sparklyr and dplyr

Question

Just now i start learning the sparklyr package using the reference sparklyr i did what was written in the document. when using the following code

delay <- flights_tbl %>% 
  group_by(tailnum) %>%
  summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
  filter(count > 20, dist < 2000, !is.na(delay)) %>%
  collect

Warning messages:
1: Missing values are always removed in SQL.
Use `AVG(x, na.rm = TRUE)` to silence this warning 
2: Missing values are always removed in SQL.
Use `AVG(x, na.rm = TRUE)` to silence this warning 

> delay
# A tibble: 2,961 x 4
   tailnum count  dist  delay
         
 1 N14228  111    1547  3.71 
 2 N24211  130    1330  7.70 
 3 N668DN   49.0  1028  2.62 
 4 N39463  107    1588  2.16 
 5 N516JB  288    1249 12.0  
 6 N829AS  230     228 17.2  
 7 N3ALAA   63.0  1078  3.59 
 8 N793JB  283    1529  4.72 
 9 N657JB  285    1286  5.03 
10 N53441  102    1661  0.941
# ... with 2,951 more rows

In the similar way i want apply the same operations on nycflights13::flights dataset using dplyr package

nycflights13::flights %>% 
    group_by(tailnum) %>%
    summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
     filter(count > 20, dist < 2000, !is.na(delay)) 

# A tibble: 1,319 x 4
   tailnum count  dist   delay
          
 1 N102UW     48   536   2.94 
 2 N103US     46   535 - 6.93 
 3 N105UW     45   525 - 0.267
 4 N107US     41   529 - 5.73 
 5 N108UW     60   534 - 1.25 
 6 N109UW     48   536 - 2.52 
 7 N110UW     40   535   2.80 
 8 N111US     30   536 - 0.467
 9 N11206    111  1414  12.7  
10 N112US     38   535 - 0.947
# ... with 1,309 more rows

My problem is why i am getting the different results ?

As mention in the documentation dplyr is the complete backend operations for sparklyr.

 > sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2   dplyr_0.7.4    sparklyr_0.7.0

loaded via a namespace (and not attached):
 [1] DBI_0.7            readr_1.1.1        withr_2.1.1       
 [4] nycflights13_0.2.2 rprojroot_1.3-2    lattice_0.20-35   
 [7] foreign_0.8-69     pkgconfig_2.0.1    config_0.2        
[10] utf8_1.1.3         compiler_3.4.0     stringr_1.3.0     
[13] parallel_3.4.0     xtable_1.8-2       Rcpp_0.12.15      
[16] cli_1.0.0          shiny_1.0.5        plyr_1.8.4        
[19] httr_1.3.1         tools_3.4.0        openssl_1.0       
[22] nlme_3.1-131.1     broom_0.4.3        R6_2.2.2          
[25] dbplyr_1.2.1       bindr_0.1          purrr_0.2.4       
[28] assertthat_0.2.0   curl_3.1           digest_0.6.15     
[31] mime_0.5           stringi_1.1.6      rstudioapi_0.7    
[34] reshape2_1.4.3     hms_0.4.1          backports_1.1.2   
[37] htmltools_0.3.6    grid_3.4.0         glue_1.2.0        
[40] httpuv_1.3.5       rlang_0.2.0        psych_1.7.8       
[43] magrittr_1.5       rappdirs_0.3.1     lazyeval_0.2.1    
[46] yaml_2.1.16        crayon_1.3.4       tidyr_0.8.0       
[49] pillar_1.1.0       base64enc_0.1-3    mnormt_1.5-5      
[52] jsonlite_1.5       tibble_1.4.2       Lahman_6.0-0

akrun · Accepted Answer

The key difference is that in the non-sparklyr, we are not using na.rm = TRUE in mean, therefore, those elements having NA in 'distance' or 'arr_delay' will become NA when we take the mean but in sparklyr the NA values are already removed so the argument is not needed

We can check the NA elements in 'distance' and 'arr_delay'

nycflights13::flights %>% 
       summarise_at(vars(distance, arr_delay), funs(sum(is.na(.))))
# A tibble: 1 x 2
#  distance arr_delay
#          
#1        0      9430  #### number of NAs

So, if we correct for that, then the output will be the same

res <- nycflights13::flights %>% 
    group_by(tailnum) %>%
    summarise(count = n(),
              dist = mean(distance, na.rm = TRUE),
              delay = mean(arr_delay, na.rm = TRUE)) %>%
     filter(count > 20, dist < 2000, !is.na(delay))  %>%
     arrange(tailnum)
res
# A tibble: 2,961 x 4
#   tailnum count  dist   delay
#          
# 1 N0EGMQ    371   676   9.98 
# 2 N10156    153   758  12.7  
# 3 N102UW     48   536   2.94 
# 4 N103US     46   535 - 6.93 
# 5 N104UW     47   535   1.80 
# 6 N10575    289   520  20.7  
# 7 N105UW     45   525 - 0.267
# 8 N107US     41   529 - 5.73 
# 9 N108UW     60   534 - 1.25 
#10 N109UW     48   536 - 2.52 
# ... with 2,951 more rows

Using sparklyr

library(sparklyr)
library(dplyr)
library(nycflights13)
sc <- spark_connect(master = "local")


flights_tbl <- copy_to(sc, nycflights13::flights, "flights")

delay <- flights_tbl %>% 
  group_by(tailnum) %>%
  summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
  filter(count > 20, dist < 2000, !is.na(delay)) %>% 
  arrange(tailnum) %>%
  collect
delay
# A tibble: 2,961 x 4
#   tailnum count  dist   delay
#          
# 1 N0EGMQ  371     676   9.98 
# 2 N10156  153     758  12.7  
# 3 N102UW   48.0   536   2.94 
# 4 N103US   46.0   535 - 6.93 
# 5 N104UW   47.0   535   1.80 
# 6 N10575  289     520  20.7  
# 7 N105UW   45.0   525 - 0.267
# 8 N107US   41.0   529 - 5.73 
# 9 N108UW   60.0   534 - 1.25 
#10 N109UW   48.0   536 - 2.52 
# ... with 2,951 more rows

Getting different results when using the sparklyr and dplyr

Answers (1)

Related Questions