Ramesh devarakonda
Ramesh devarakonda

Reputation: 53

SQOOP Incremental Import with Lastmodified

I am trying to understand SQOOP- Incremental Imports with "LastModified" option. Since the HDFS does not meant for file updates, how this is handled internally? Is it via create a separate file and point original to this new file? In case of append - create a new file with new records?? may be??

But how last modified flag updates HDFS data... Logic behind this?

Upvotes: 3

Views: 2198

Answers (1)

Dev
Dev

Reputation: 13753

--append mode

You are only adding new data. Each Sqoop incremental import operation will add part files in the hdfs directory. For example - part-m-00000 , part-m-00001

--lastmodified mode

There are updates too in addition to newly added data. When you try to run this command 2nd time, it will give you error (because target directory is same):

Error during import: --merge-key or --append is required when using --incremental lastmodified and the output directory exists.

Now if you add --append, it will simply add new files to the same directory. Now you have to manually merge data in two files using Sqoop Merge.

As per the docs,

The merge tool allows you to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in last-modified mode will generate multiple datasets in HDFS where successively newer data appears in each dataset. The merge tool will "flatten" two datasets into one, taking the newest available records for each primary key.

Otherwise, you can opt for --merge-key, it will take care of merging automatically.

Upvotes: 2

Related Questions