Reputation: 5756
I'm trying to submit Dataproc with .NET spark Job.
The command line looks like:
gcloud dataproc jobs submit spark \
--cluster=<cluster> \
--region=<region> \
--class=org.apache.spark.deploy.dotnet.DotnetRunner \
--jars=gs://bucket/microsoft-spark-2.4.x-0.11.0.jar \
--archives=gs://bucket/dotnet-build-output.zip \
-- find
This command line should call find
function to show the files in the current directory.
And I see only 2 files:
././microsoft-spark-2.4.x-0.11.0.jar
././microsoft-spark-2.4.x-0.11.0.jar.crc
Eventually GCP does not unpack the file from Storage specified as --archives
. The specified file exists and the path was copied from GCP UI. Also I tried to run an exact assembly file from the archive (that exists), but it reasonably fails with File does not exist
Upvotes: 8
Views: 2458
Reputation: 26548
I think the problem is that your command ran in Spark driver which ran on the master node, because Dataproc runs in client mode by default. You can change it by adding --properties spark.submit.deployMode=cluster
when submitting the job.
According to the usage help of the --archives
flag:
--archives=[ARCHIVE,...] Comma separated list of archives to be extracted into the working directory of each executor. Must be one of the following file formats: .zip, .tar, .tar.gz, or .tgz.
The archive will only be copied to both driver and executor dirs, but will only be extracted for executors. I tested submitting a job with --archives=gs://my-bucket/foo.zip
which includes 2 files foo.txt
and deps.txt
, then I could find the extracted files on worker nodes:
my-cluster-w-0:~$ sudo ls -l /hadoop/yarn/nm-local-dir/usercache/root/filecache/40/foo.zip/
total 4
-r-x------ 1 yarn yarn 11 Jul 2 22:09 deps.txt
-r-x------ 1 yarn yarn 0 Jul 2 22:09 foo.txt
Upvotes: 2
Reputation: 5756
as @dagang mentioned --archives
and --files
parameters will not copy zip file to the driver instance, so that is the wrong direction.
I used this approach:
gcloud dataproc jobs submit spark \
--cluster=<cluster> \
--region=<region> \
--class=org.apache.spark.deploy.dotnet.DotnetRunner \
--jars=gs://<bucket>/microsoft-spark-2.4.x-0.11.0.jar \
-- /bin/sh -c "gsutil cp gs://<bucket>/builds/test.zip . && unzip -n test.zip && chmod +x ./Spark.Job.Test && ./Spark.Job.Test"
Upvotes: 0