Richard Redding
Richard Redding

Reputation: 327

how to use spark_apply_bundle

I am trying to use spark_apply_bundle to limit the number of packages/data transferred to the worker nodes on a YARN managed cluster. As mentioned in here I must pass the path of the tarball to spark_apply as the packages argument and I also must make it available via "sparklyr.shell.files" in the spark config.

My questions are:

Currently my unsuccessful script look something like this:

bundle <- paste(getwd(), list.files()[grep("\\.tar$",list.files())][1], sep = "/")

...

config$sparklyr.shell.files <- bundle
sc <- spark_connect(master = "yarn-client", config = config)

...

spark_apply(sdf, f, packages = bundle)

Upvotes: 3

Views: 468

Answers (1)

Richard Redding
Richard Redding

Reputation: 327

The spark job succeeded by copying the tarball to hdfs. It seems as if it's plausible to use some other method (e.g. copying the file to each worker node) but this seems to be the easiest solution.

The updated script looks as follows:

bundle <- paste(getwd(), list.files()[grep("\\.tar$",list.files())][1], sep = "/")

...

hdfs_path <- "hdfs://nn.example.com/some/directory/"
hdfs_bundle <- paste0(hdfs_path, basename(bundle))
system(paste("hdfs dfs -put", bundle, hdfs_path))
config$sparklyr.shell.files <- hdfs_bundle
sc <- spark_connect(master = "yarn-client", config = config)

...

spark_apply(sdf, f, packages = bundle)

Upvotes: 2

Related Questions