Alex
Alex

Reputation: 111

GCP Dataproc: Directly working with Spark over Yarn Cluster

I'm trying to minimize changes in my code so I'm wondering if there is a way to submit a spark-streaming job from my personal PC/VM as follows:

spark-submit --class path.to.your.Class --master yarn --deploy-mode client \
    [options] <app jar> [app options]

without using GCP SDK.

I also have to specify a directory with configuration files HADOOP_CONF_DIR which I was able to download from Ambari. Is there a way to do the same?

Thank you

Upvotes: 2

Views: 931

Answers (1)

Ben Sidhom
Ben Sidhom

Reputation: 1588

Setting up an external machine as a YARN client node is generally difficult to do and not a workflow that will work easily with Dataproc.

In a comment you mention that what you really want to do is

  1. Submit a Spark job to the Dataproc cluster.
  2. Run a local script on each "batchFinish" (StreamingListener.onBatchCompleted?).
    • The script has dependencies that mean it cannot run inside of the Dataproc master node.

Again, configuring a client node outside of the Dataproc cluster and getting it to work with spark-submit is not going to work directly. However, if you can configure your network such that the Spark driver (running within Dataproc) has access to the service/script you need to run, and then invoke that when desired.

If you run your service on a VM that has access to the network of the Dataproc cluster, then your Spark driver should be able to access the service.

Upvotes: 1

Related Questions