Reputation: 1100
Assuming sc
is an existing spark(lyr) connection, the names given in dplyr::mutate()
are ignored:
iris_tbl <- sdf_copy_to(sc, iris)
iris_tbl %>%
spark_apply(function(e){
library(dplyr)
e %>% mutate(slm = median(Sepal_Length))
})
## Source: table<sparklyr_tmp_60a41ac01b4e> [?? x 6]
## Database: spark_connection
# Sepal_Length Sepal_Width Petal_Length Petal_Width Species X6
# <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
# 1 5.1 3.5 1.4 0.2 setosa 5.8
# 2 4.9 3.0 1.4 0.2 setosa 5.8
# 3 4.7 3.2 1.3 0.2 setosa 5.8
# ...
A workaround would be to provide the names using the columns
argument:
iris_tbl %>%
spark_apply(function(e){
library(dplyr)
e %>% mutate(slm = median(Sepal_Length))
}, columns = c(colnames(iris), "slm"))
## Source: table<sparklyr_tmp_60a4126692e7> [?? x 6]
## Database: spark_connection
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species slm
# <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
# 1 5.1 3.5 1.4 0.2 setosa 5.8
# 2 4.9 3.0 1.4 0.2 setosa 5.8
# 3 4.7 3.2 1.3 0.2 setosa 5.8
# ...
Is it a bug?
Here the sessionInfo()
Oracle Distribution of R version 3.3.0 (--)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Oracle Linux Server 7.2
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2 tidyr_0.7.2 dbplot_0.2.0 rlang_0.1.4
[5] anytime_0.3.0 jsonlite_1.5 magrittr_1.5 ggplot2_2.2.1
[9] DBI_0.7 dtplyr_0.0.2 dplyr_0.7.4 kudusparklyr_0.1.0
[13] sparklyr_0.7.0 data.table_1.10.4-3
loaded via a namespace (and not attached):
[1] Rcpp_0.12.14 dbplyr_1.1.0 plyr_1.8.4 bindr_0.1
[5] base64enc_0.1-3 tools_3.3.0 digest_0.6.12 gtable_0.2.0
[9] tibble_1.3.4 nlme_3.1-127 lattice_0.20-33 pkgconfig_2.0.1
[13] psych_1.7.8 shiny_1.0.5 rstudioapi_0.7 yaml_2.1.16
[17] parallel_3.3.0 withr_2.1.0 httr_1.3.1 stringr_1.2.0
[21] rprojroot_1.2 grid_3.3.0 glue_1.2.0 R6_2.2.2
[25] foreign_0.8-66 purrr_0.2.4 reshape2_1.4.2 scales_0.5.0
[29] backports_1.1.1 htmltools_0.3.6 assertthat_0.2.0 mnormt_1.5-5
[33] RApiDatetime_0.0.3 colorspace_1.3-2 mime_0.5 xtable_1.8-2
[37] httpuv_1.3.5 config_0.2 stringi_1.1.6 openssl_0.9.9
[41] munsell_0.4.3 lazyeval_0.2.1 broom_0.4.3
I know, that it's an old R version, but that's not up to me ...
Upvotes: 4
Views: 624
Reputation: 493
That's how it's designed. This link states that:
By default spark_apply() derives the column names from the input Spark data frame. Use the names argument to rename or add new columns.
trees_tbl %>%
spark_apply(
function(e) data.frame(2.54 * e$Girth, e),
names = c("Girth(cm)", colnames(trees)))
Upvotes: 0