how to use spark_apply_bundle

Question

I am trying to use spark_apply_bundle to limit the number of packages/data transferred to the worker nodes on a YARN managed cluster. As mentioned in here I must pass the path of the tarball to spark_apply as the packages argument and I also must make it available via "sparklyr.shell.files" in the spark config.

My questions are:

Can the path to the tarball be relative to the project's working directory, if not then should it be stored on hdfs or somewhere else?
What should be passed to "sparklyr.shell.files"? Is it a duplicate of the path passed to spark_apply?

Currently my unsuccessful script look something like this:

bundle <- paste(getwd(), list.files()[grep("\.tar$",list.files())][1], sep = "/")

...

config$sparklyr.shell.files <- bundle
sc <- spark_connect(master = "yarn-client", config = config)

...

spark_apply(sdf, f, packages = bundle)

how to use spark_apply_bundle

Answers (1)

Related Questions