Reputation: 205
How to automate sqoop incremental import using sqoop job?
As for as I know, sqoop job remembers the last value. If we create a sqoop job like
sqoop job --create myjob -- import --connect blah blah..
sqoop exec myjob
and automate the job, It will create the job each time it executes. So, we will get job already exists error.
So, Is it possible to automate sqoop incremental import using sqoop job?
Please let me know your knowledge on this. Thanks in advance.
Upvotes: 2
Views: 1712
Reputation: 282
you can create a shell script, to trigger sqoop
process, we can add a condition to check if sqoop
job already exists, by using sqoop
list --job name, if exits can trigger saved job and if not then create and trigger the sqoop job.. for scheduling you can create cron
jobs,oozie
or can use schedulers like uc4
, airflow
, autosys
...
Upvotes: 0
Reputation: 401
you can simply create a sqoop job by using option file and your job will be saved. Next time, when you will run the job , it will automatically fetch the incremental data.
Upvotes: 0
Reputation: 38315
It's better to have full control over the bookmark for the restatement possibility. That is why custom mechanism is better than sqoop job.
The workflow template is:
Read latest bookmark. Can be stored in some rdbms(mySQL) or in file in Hadoop/etc. Bookmark can be selected from the target table if not exists. For example select max(etl_load_timestamp) from target_table
.
pass the bookmark condition to the sqoop query as a parameter, for example concatenate and etl_update_timestamp>$your_bookmark
to the WHERE
clause.
If the increment data is too big (bookmark value is too far from current_timestamp), run sqoop in the monthly/daily/hourly loop, saving the bookmark after each iteration.
The bookmark can be timestamp or some run_id (source system etl sequence)
If you need to restate the data (load retroactively) then UPDATE the bookmark and workflow will re-load data starting from the bookmark.
Upvotes: 1