Can `data.table` be used safely instead of `dplyr` with `sparkR`, `sparklyr` and `rsparkling`?

Question

I am refreshing my H2O on Spark knowledge using RStudio Spark Extensions as one of the sources.

Frankly everywhere I look, I find dplyr's unnecessarily convoluted efforts in coming out with even simplest results quite painful if not opaque in most cases.

Here is an example taken from the site. The mtcars dataset has been copied into Spark cluster under the name mtcars_tbl then, while subseting/filtering it for cars with more than 100 hp, the "tbl_spark", "tbl_sql", "tbl_lazy", "tbl" - class table was split in train and test subsets contained in a 2-component list.

The list name is 'partitions' and the dplyr code to achieve it is this:

 partitions <- mtcars_tbl %>% filter(hp >= 100) %>%
 mutate(cyl8 = cyl == 8) %>%
 sdf_partition(training = 0.5, test = 0.5, seed = 1099)

Note: In my opinion, H2O has a clearer, more informative way of doing this

Then, a model is trained within the H2O platform fitting 'mpg' for various car weight and cylinder configurations.

At some point - and for prediction purposes - there is a need to select (isolate) the column 'mpg' form the test subset and use it as numeric vector.

Here is the dplyr code implemented for an action as simple as this:

mpg1 <- partitions$test %>% 
                select(mpg) %>% 
                      collect() %>%
                           `[[`("mpg")

... and here is the data.table code clear, compact and simple applied to the "partitions" list:

mpg2 <- as.data.frame(partitions$test)[['mpg']]

mpg3 <- as.data.table(partitions$test)[['mpg']]

Note: The code would have been even more uncluttered had the two subsets been treated as dataframes or data.tables from the beginning.

As of comparing the three vectors:

identical(mpg1, mpg2, mpg3)

TRUE

all.equal(mpg1, mpg2, mpg3)

TRUE

Note: the function dplyr::collect() shown above, actually changes the class of mpg1 from

"tbl_spark" "tbl_sql" "tbl_lazy" "tbl"

to

"tbl_df" "tbl" "data.frame"

which, subsequently is turned into a numeric vector in the last step, namely '[['('mpg').

Well, seemingly there are a number of superfluous steps in the dplyr code above. And this is just a simple case!

I wonder if dplyr could be safely circumvented for operations that take place within R hence, my question in the title.

Note: I know that one option is SQL query; is there any other (better) way?

Thank you!

bcarlsen · Accepted Answer

Why not use pull(partitions$test, mpg)? Your approach is not the idiomatic dplyr way to do this operation, so it's no surprise that it's frustrating for you.

There is no data.table interface for Spark. You can use Spark SQL if you prefer that. sparklyr just generates Spark SQL under the hood.

You are certainly not obligated to use dplyr but I encourage you to familiarize yourself more with its syntax before discounting it completely - there are likely much more concise ways of doing the stuff you find frustrating.

Can `data.table` be used safely instead of `dplyr` with `sparkR`, `sparklyr` and `rsparkling`?

Answers (1)

Related Questions