dr11
dr11

Reputation: 5756

Dataproc does not unpack files passed as Archive

I'm trying to submit Dataproc with .NET spark Job.

The command line looks like:

gcloud dataproc jobs submit spark \
    --cluster=<cluster> \
    --region=<region> \
    --class=org.apache.spark.deploy.dotnet.DotnetRunner \
    --jars=gs://bucket/microsoft-spark-2.4.x-0.11.0.jar \
    --archives=gs://bucket/dotnet-build-output.zip \
    -- find

This command line should call find function to show the files in the current directory.

And I see only 2 files:

././microsoft-spark-2.4.x-0.11.0.jar
././microsoft-spark-2.4.x-0.11.0.jar.crc

Eventually GCP does not unpack the file from Storage specified as --archives. The specified file exists and the path was copied from GCP UI. Also I tried to run an exact assembly file from the archive (that exists), but it reasonably fails with File does not exist

Upvotes: 8

Views: 2458

Answers (2)

Dagang Wei
Dagang Wei

Reputation: 26548

I think the problem is that your command ran in Spark driver which ran on the master node, because Dataproc runs in client mode by default. You can change it by adding --properties spark.submit.deployMode=cluster when submitting the job.

According to the usage help of the --archives flag:

 --archives=[ARCHIVE,...]
   Comma separated list of archives to be extracted into the working
   directory of each executor. Must be one of the following file formats:
   .zip, .tar, .tar.gz, or .tgz.

The archive will only be copied to both driver and executor dirs, but will only be extracted for executors. I tested submitting a job with --archives=gs://my-bucket/foo.zip which includes 2 files foo.txt and deps.txt, then I could find the extracted files on worker nodes:

my-cluster-w-0:~$ sudo ls -l /hadoop/yarn/nm-local-dir/usercache/root/filecache/40/foo.zip/

total 4
-r-x------ 1 yarn yarn 11 Jul  2 22:09 deps.txt
-r-x------ 1 yarn yarn  0 Jul  2 22:09 foo.txt

Upvotes: 2

dr11
dr11

Reputation: 5756

as @dagang mentioned --archives and --files parameters will not copy zip file to the driver instance, so that is the wrong direction.

I used this approach:

gcloud dataproc jobs submit spark \
        --cluster=<cluster> \
        --region=<region> \
        --class=org.apache.spark.deploy.dotnet.DotnetRunner \
        --jars=gs://<bucket>/microsoft-spark-2.4.x-0.11.0.jar \
        -- /bin/sh -c "gsutil cp gs://<bucket>/builds/test.zip . && unzip -n test.zip && chmod +x ./Spark.Job.Test && ./Spark.Job.Test"

Upvotes: 0

Related Questions