Dataproc does not unpack files passed as Archive

Question

I'm trying to submit Dataproc with .NET spark Job.

The command line looks like:

gcloud dataproc jobs submit spark \
    --cluster= \
    --region= \
    --class=org.apache.spark.deploy.dotnet.DotnetRunner \
    --jars=gs://bucket/microsoft-spark-2.4.x-0.11.0.jar \
    --archives=gs://bucket/dotnet-build-output.zip \
    -- find

This command line should call find function to show the files in the current directory.

And I see only 2 files:

././microsoft-spark-2.4.x-0.11.0.jar
././microsoft-spark-2.4.x-0.11.0.jar.crc

Eventually GCP does not unpack the file from Storage specified as --archives. The specified file exists and the path was copied from GCP UI. Also I tried to run an exact assembly file from the archive (that exists), but it reasonably fails with File does not exist

dr11 · Accepted Answer

as @dagang mentioned --archives and --files parameters will not copy zip file to the driver instance, so that is the wrong direction.

I used this approach:

gcloud dataproc jobs submit spark \
        --cluster= \
        --region= \
        --class=org.apache.spark.deploy.dotnet.DotnetRunner \
        --jars=gs:///microsoft-spark-2.4.x-0.11.0.jar \
        -- /bin/sh -c "gsutil cp gs:///builds/test.zip . && unzip -n test.zip && chmod +x ./Spark.Job.Test && ./Spark.Job.Test"

Dataproc does not unpack files passed as Archive

Answers (2)

Related Questions