Reputation: 469
I am using Sqoop to import oracle tables into HDFS.
I have around 50 tables to import and out of 50 tables there are 10-15 tables are too large(around 50GB).
For the very first time I want to import them as a full load and after that I will import only incremental data.
Currently I have prepared 2 shell scripts as follows:- 1. Script for full dump(It will take full dump daily) 2. Script for incremental data(As I already have taken full dump now it will fetch only incremental data).
And I have scheduled those 2 scripts at a particular time say 7am.
Both scripts are running fine but as you can see it will execute only two sqoop jobs parallely.
But I want to start 4 sqoop jobs at a time to get more parallelism.
So How I can achieve more parallelism by executing more than 2 sqoop jobs in parallel.
Any help regarding this wouls be highly appreciated.
Here is the sample of my shell scripts:-
sqoop job --exec sqoop_job1
sqoop job --exec sqoop_job2
Upvotes: 0
Views: 450
Reputation: 635
Apache Oozie is the orchestration tool which can help you to run jobs in sequence or in parallel based on your need. If you have Apache Oozie installed, you can try that out. It has an action for sqoop and you don't need to go via shell script. Oozie has all the features of workflow or orchestration tool like re-run or if full load fail to stop everything etc
Example workflow.xml which defines how to run multiple jobs
<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns="uri:oozie:workflow:0.2" name="sqoop-wf-fork-example">
<start to="sqoop-wf-fork"/>
<fork name="sqoop-wf-fork">
<path start="sqoop-categories"/>
<path start="sqoop-customers"/>
</fork>
<action name="sqoop-categories">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<command>import --connect jdbc:mysql://localhost:3306/retail_db --username root --password cloudera --table categories --driver com.mysql.jdbc.Driver --delete-target-dir --m 1</command>
</sqoop>
<ok to="joinActions"/>
<error to="fail"/>
</action>
<action name="sqoop-customers">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<command>import --connect jdbc:mysql://localhost:3306/retail_db --username root --password cloudera --table customers --driver com.mysql.jdbc.Driver --delete-target-dir --m 1</command>
</sqoop>
<ok to="joinActions"/>
<error to="fail"/>
</action>
<join name="joinActions" to="end-wf"/>
<kill name="fail">
<message>Sqoop failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end-wf"/>
</workflow-app>
Upvotes: 0