Reputation: 2083
Im working with Sqoop incremental import by taking the data everyday into my hive table. I have the following scenario: I have an RDBMS table: empdata with columns
id name city
1 Sid Amsterdam
2 Bob Delhi
3 Sun Dubai
4 Rob London
I am importing the data into Hive, using Sqoop incremental import through a cron job which shell script to do the work.
#!/bin/bash
DATE=$(date +"%d-%m-%y")
while IFS=":" read -r server dbname tablename; do
sqoop import --connect jdbc:mysql://$server/$dbname --table $tablename --username root --password cloudera --hive-import --hive-table dynpart --hive-partition-key 'thisday' --hive-partition-value $DATE --target-dir '/user/hive/newimp5' --incremental-append --check-column id --last-value $(hive -e "select max(id) from $tablename");
done</home/cloudera/Desktop/MyScripts/tables.txt
The above script for incremental load is working fine. But now I have another requirement, which is to check if there are any updates to previous records. Like if the record:
1 Rob London is updated to 1 Rob NewYork
I need to take that updated record(s) along with the incremental import but only the updated value should be present in Hive table so that I don't have duplicate values either. Could anyone tell me how can I accomplish it ?
Upvotes: 1
Views: 983
Reputation: 753
In sqoop you can not use 2 columns in --check-column and even if you are allowed (you can combine 2 fields in --check-column see ex : Sqoop Incremental Import multiple columns in check-column) then also you are not sure if the city will have the higher or lower value next time so you can not really use city field in check column . now you have following options :
1) in your RDBMS create a new table where you have another field of type timestamp and this will be auto incremented so that every time you have any update or insert it has the current time stamp. and then after incremental append you again import this table using incremental lastmodified ...--check-column ts_field -- last-value also use "--merge-key id" in sqoop import so that it can merge the updates on the basis of id.
2) a) first run your sqoop import with --check-cloumn id --incremental append last value b) then run the sqoop import again without using --incremental and with target dir as some temporary folder c) then using sqoop merge merge the dataset(target-dir in step a. and b ) where new data will be in target dir of step a onto tar dir of step b and --merge key will be "id".
Please let me know if you have any further questions.
Upvotes: 2