Reputation: 9014
I understand that Sqoop offers couple of methods to handle incremental imports
Append mode
lastmodified mode
Questions on Append mode:
Is the append mode supported only for the check column as integer data type? What if i want to use a date or a timestamp column but still i want to only append to the data already in HDFS?
Does this mode mean that the new data is appended to the existing HDFS file or it picks only the new data from the source DB or both?
Lets say that the check-column is an id column in the source table. There already exists a row in the table where the id column is 100. When the sqoop import is run in the append mode where the last-value is 50. Now it imports all rows where the id > 50. When run again with last-value as 150, but this time the row with the id value as 100 was updated to 200. Would this row also be pulled?
Example: Lets say there is a table called customers with one of the records as follows. The first column is the id.
100 abc xyz 5000
When Sqoop job is run in the append mode and last-value as 50 for the id column, then it would pull the above record.
Now the same record is changed and id also gets changed (hypothetical example though) as follows
200 abc xyz 6000
If you run the sqoop command again, would this pull the above record as well was the question.
Questions on lastmodified mode:
Looks like running sqoop with this mode would merge the existing data with the new data using 2 MR jobs internally. What is the column that sqoop use to compare the old and the new for the merge process?
Can user specify the column for the merge process?
Can more than one column be provided that have to be used for the merge process?
Should the target-dir exist for the merge process to happen, so that sqoop treats the existing target dir as the old dataset? Otherwise, how would Sqoop what is the old data set to be merged?
Upvotes: 0
Views: 1299
Reputation: 3956
Answers for append mode:
Yes, it needs to be integer
Both
Question is not clear.
Answers for lastmodified mode:
Incremental load does not merge data with lastmodified, it is primarily to pull updated and inserted data using timestamp.
Merge process is completely different. Once you have both old data and new data, you can merge new data onto old data to a different directory. You can see detailed explanation here.
Merge process works with only one field
target-dir should not exist. The video covers complete merge process
Upvotes: 0