Kavitha
Kavitha

Reputation: 205

How to automate sqoop incremental import using sqoop job?

How to automate sqoop incremental import using sqoop job?

As for as I know, sqoop job remembers the last value. If we create a sqoop job like

sqoop job --create myjob -- import --connect blah blah..

sqoop exec myjob

and automate the job, It will create the job each time it executes. So, we will get job already exists error.

So, Is it possible to automate sqoop incremental import using sqoop job?

Please let me know your knowledge on this. Thanks in advance.

Upvotes: 2

Views: 1712

Answers (3)

Prabhanj
Prabhanj

Reputation: 282

you can create a shell script, to trigger sqoop process, we can add a condition to check if sqoop job already exists, by using sqoop list --job name, if exits can trigger saved job and if not then create and trigger the sqoop job.. for scheduling you can create cron jobs,oozie or can use schedulers like uc4, airflow, autosys...

Upvotes: 0

Manish Pansari
Manish Pansari

Reputation: 401

you can simply create a sqoop job by using option file and your job will be saved. Next time, when you will run the job , it will automatically fetch the incremental data.

Upvotes: 0

leftjoin
leftjoin

Reputation: 38315

It's better to have full control over the bookmark for the restatement possibility. That is why custom mechanism is better than sqoop job.

The workflow template is:

  1. Read latest bookmark. Can be stored in some rdbms(mySQL) or in file in Hadoop/etc. Bookmark can be selected from the target table if not exists. For example select max(etl_load_timestamp) from target_table.

  2. pass the bookmark condition to the sqoop query as a parameter, for example concatenate and etl_update_timestamp>$your_bookmark to the WHERE clause.

  3. Run sqoop
  4. Save the latest bookmark (see 1.)

If the increment data is too big (bookmark value is too far from current_timestamp), run sqoop in the monthly/daily/hourly loop, saving the bookmark after each iteration.

The bookmark can be timestamp or some run_id (source system etl sequence)

If you need to restate the data (load retroactively) then UPDATE the bookmark and workflow will re-load data starting from the bookmark.

Upvotes: 1

Related Questions