Abdul
Abdul

Reputation: 73

How to run Hadoop utils on Dataproc cluster programmatically?

I have:

I want to run one of the Hadoop utils on master node (hadoop distcp) programatically. What is the best way to do that? So far I have the next clue: ssh to master node and run util from there. Is there any other option to accomplish the same goal?

Upvotes: 4

Views: 764

Answers (1)

Igor Dvorzhak
Igor Dvorzhak

Reputation: 4457

To run DistCp you can submit regular Hadoop MR job through Dataproc API or gcloud and specify org.apache.hadoop.tools.DistCp as a main class:

gcloud dataproc jobs submit hadoop --cluster=<CLUSTER> \
    --class=org.apache.hadoop.tools.DistCp -- <SRC> <DST>

From Python you can use either REST API directly or Python Client library to submit DistCp job.

Upvotes: 4

Related Questions