Reputation: 301
I am refreshing my H2O on Spark
knowledge using RStudio Spark Extensions as one of the sources.
Frankly everywhere I look, I find dplyr
's unnecessarily convoluted efforts in coming out with even simplest results quite painful if not opaque in most cases.
Here is an example taken from the site. The mtcars
dataset has been copied into Spark
cluster under the name mtcars_tbl
then, while subseting/filtering it for cars with more than 100 hp,
the "tbl_spark", "tbl_sql", "tbl_lazy", "tbl" - class table was
split in train and test subsets contained in a 2-component list.
The list name is 'partitions' and the dplyr
code to achieve it is this:
partitions <- mtcars_tbl %>% filter(hp >= 100) %>%
mutate(cyl8 = cyl == 8) %>%
sdf_partition(training = 0.5, test = 0.5, seed = 1099)
Note: In my opinion, H2O
has a clearer, more informative way of doing this
Then, a model is trained within the H2O
platform fitting 'mpg' for various car weight and cylinder configurations.
At some point - and for prediction purposes - there is a need to select (isolate) the column 'mpg' form the test subset and use it as numeric vector.
Here is the dplyr
code implemented for an action as simple as this:
mpg1 <- partitions$test %>%
select(mpg) %>%
collect() %>%
`[[`("mpg")
... and here is the data.table
code clear, compact and simple applied to the "partitions" list:
mpg2 <- as.data.frame(partitions$test)[['mpg']]
mpg3 <- as.data.table(partitions$test)[['mpg']]
Note: The code would have been even more uncluttered had the two subsets been treated as dataframe
s or data.table
s from the beginning.
As of comparing the three vectors:
identical(mpg1, mpg2, mpg3)
TRUE
all.equal(mpg1, mpg2, mpg3)
TRUE
Note: the function dplyr::collect()
shown above, actually changes the class of mpg1 from
"tbl_spark" "tbl_sql" "tbl_lazy" "tbl"
to
"tbl_df" "tbl" "data.frame"
which, subsequently is turned into a numeric vector in the last step, namely '[['('mpg')
.
Well, seemingly there are a number of superfluous steps in the dplyr
code above. And this is just a simple case!
I wonder if dplyr
could be safely circumvented for operations that take place within R hence, my question in the title.
Note: I know that one option is SQL
query; is there any other (better) way?
Thank you!
Upvotes: 3
Views: 514
Reputation: 1441
Why not use pull(partitions$test, mpg)
? Your approach is not the idiomatic dplyr
way to do this operation, so it's no surprise that it's frustrating for you.
There is no data.table
interface for Spark. You can use Spark SQL if you prefer that. sparklyr
just generates Spark SQL under the hood.
You are certainly not obligated to use dplyr
but I encourage you to familiarize yourself more with its syntax before discounting it completely - there are likely much more concise ways of doing the stuff you find frustrating.
Upvotes: 0