Reputation: 91
How can we automate the incremental import in SQoop ?
In incremental import, we need to give the --last-value
to start the import from the last value onwards, but my job is to frequently import from RDBMS, I don't want to give last value manually, is there any way we can automate this process?
Upvotes: 4
Views: 9969
Reputation: 744
This can be achieved easily with a sqoop job
1. Create a sqoop job(There is a space before "import")
sqoop job --create JobName6 \
-- import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username=username \
--password-file /user/sqoop/password \
--table departments \
--target-dir /user/hive/warehouse/test.db/departments \
--table departments \
--split-by department_id \
--check-column department_id \
--incremental append \
--last-value 0;
2. Run the sqoop job sqoop job --exec JobName6; check the values in the location in HDFS
3. Insert some data in source table (mysql) INSERT INTO departments VALUES (9,'New Data1 '),(10,'New Data2');
2. Run the sqoop job again . sqoop job --exec JobName6; check the values in the location in HDFS again .
Similarly for Hive Import
sqoop job --create JobName1 \
-- import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username=username\
--password-file /user/sqoop/password \
--table departments \
--hive-import \
--hive-table department \
--split-by department_id \
--check-column department_id \
--incremental append \
--last-value 0;
Upvotes: 2
Reputation: 31
You can take advantage of the built-in Sqoop metastore
You can create a simple incremental import job with the following command:
sqoop job \ --create <> \ --\ import \ --connect <> \ --username <> \ --password <> \ --table <> \ --incremental append \ --check-<> \ --last-value 0
And start it with the --exec parameter:
sqoop job --exec <<Job Name>>
Sqoop will automatically serialize the last imported value back into the metastore after each successful incremental job
Upvotes: 3
Reputation: 2650
An alternate approach to @Durga Viswanath Gadiraju answer.
In case you are importing the data to a hive table , you could query the last updated value from the hive table and pass the value to the sqoop import query. You could use shell script or oozie actions for achieving this.
Shell script :
lastupdatedvalue=`hive -e 'select last_value from table` #tweak the selection query based on the logic.
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --incremental append --last-value ${lastupdatedvalue}
Oozie approach :
PFB a sudo workflow :
<workflow-app name="sqoop-to-hive" xmlns="uri:oozie:workflow:0.4">
<start to="hiveact"/>
<action name="hiveact">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<script>script.sql</script>
<capture-output/>
</hive>
<ok to="sqoopact"/>
<error to="kill"/>
<action name="sqoopact">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<command>import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --incremental append --last-value ${wf:actionData('hiveact')}</command>
</sqoop>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed</message>
</kill>
<end name="end"/>
Hope this helps.
Upvotes: 6
Reputation: 3956
One way to get it:
Create log table in database and develop incremental import as follows
Query the log table using sqoop eval command with the last value from last run
Run the sqoop import
Update the log table with the latest valueusing sqoop eval command
You need to automate the process of sqoop eval
, sqoop import
and sqoop eval
. You can submit any valid query to any database you have connectivity using sqoop eval
. Hence you can run select query to before the import to get the last value from last run and run update query to update the log table with last value from the current run.
Upvotes: 0