Reputation: 1775
I have a series of questions (sorry Google documentation is awful and not-user friendly):
gcloud dataproc jobs submit ...
command?I have read:
I have tried so far:
gcs-connector-hadoop2-latest.jar
and my_project.json
on my master and worker nodes in /etc/hadoop/conf
I have added the following, on my master and worker nodes, to /etc/hadoop/conf/core-site.xml
:
<property>
<name>google.cloud.auth.service.account.enable</name>
<value>true</value>
</property>
<property>
<name>my_project.json</name>
<value>full path to JSON keyfile downloaded for service account</value>
</property>
I tried running the following commands:
sudo gcloud dataproc jobs submit pyspark spark.py --cluster=${CLUSTER}
and
sudo gcloud dataproc jobs submit pyspark \
--jars /etc/hadoop/conf/gcs-connector-hadoop2-latest.jar \
spark.py --cluster=${CLUSTER}
No FileSystem for scheme: gs
I do not know what to do next.
Upvotes: 3
Views: 3910
Reputation: 4457
Yes, Google Dataproc is an equivalent of AWS EMR.
Yes, you can ssh into the Dataproc master node with gcloud compute ssh ${CLUSTER}-m
command and submit Spark jobs manually, but it's recommended to use Dataproc API and/or gcloud
command to submit jobs to Dataproc cluster. Note, you can use gcloud
command to submit jobs to Dataproc cluster from any machine that has gcloud
installed, you don't need to do this from Google Cloud VM, e.g. Dataproc master node.
To access Google Cloud Storage (GCS) from job submitted to Dataproc cluster you don't need to perform any configuration (Dataproc has pre-installed GCS connector and it's already configured to access GCS).
You can submit PySpark job on Dataproc cluster with the command (note, first you need to copy your PySpark job file to GCS and use it when submitting Dataproc job):
gsutil cp spark.py gs://<BUCKET>/path/spark.py
gcloud dataproc jobs submit pyspark --cluster=${CLUSTER} \
gs://<BUCKET>/path/spark.py
Upvotes: 2