user1871528
user1871528

Reputation: 1775

How to get PySpark working on Google Cloud Dataproc cluster

I have a series of questions (sorry Google documentation is awful and not-user friendly):

  1. What is the equivalent of Amazon EMR on Google Cloud, Dataproc? I'm using this documentation to run a Spark job: https://cloud.google.com/dataproc/docs/tutorials/gcs-connector-spark-tutorial
  2. Can you ssh into the head machine and run a Spark in the entire cluster or you have use Google's gcloud dataproc jobs submit ... command?
  3. When I run a Spark job locally and try to access Google Cloud Storage I do so without a problem. When I try to use Dataproc it crashes.

I have read:

I have tried so far:

I do not know what to do next.

Upvotes: 3

Views: 3910

Answers (1)

Igor Dvorzhak
Igor Dvorzhak

Reputation: 4457

  1. Yes, Google Dataproc is an equivalent of AWS EMR.

  2. Yes, you can ssh into the Dataproc master node with gcloud compute ssh ${CLUSTER}-m command and submit Spark jobs manually, but it's recommended to use Dataproc API and/or gcloud command to submit jobs to Dataproc cluster. Note, you can use gcloud command to submit jobs to Dataproc cluster from any machine that has gcloud installed, you don't need to do this from Google Cloud VM, e.g. Dataproc master node.

  3. To access Google Cloud Storage (GCS) from job submitted to Dataproc cluster you don't need to perform any configuration (Dataproc has pre-installed GCS connector and it's already configured to access GCS).

You can submit PySpark job on Dataproc cluster with the command (note, first you need to copy your PySpark job file to GCS and use it when submitting Dataproc job):

gsutil cp spark.py gs://<BUCKET>/path/spark.py
gcloud dataproc jobs submit pyspark --cluster=${CLUSTER} \
    gs://<BUCKET>/path/spark.py

Upvotes: 2

Related Questions