Reputation: 951
I have a lot of sqoop jobs running in AWS EMR, but sometimes i need to turn off this instance.
There's a way to save the last id from incremental import, maybe localy and upload it to s3 via cronjob.
My first idea is, when i create the job i just send a request to Redshift, where my data is stored and get the last id or last_modified, via bash script.
Another idea is to get the output of sqoop job --show $jobid, filter the parameter of last_id and using it to create the job again.
But i don't know if sqoop offer a way to do this more easily.
Upvotes: 0
Views: 423
Reputation: 951
I change the file sqoop-site.xml and add the endpoint to my MySQL.
Create the MySQL instance and run this queries:
CREATE TABLE SQOOP_ROOT (version INT, propname VARCHAR(128) NOT NULL, propval VARCHAR(256), CONSTRAINT SQOOP_ROOT_unq UNIQUE (version, propname));
and INSERT INTO SQOOP_ROOT VALUES(NULL, 'sqoop.hsqldb.job.storage.version', '0');
Change the original sqoop-site.xml adding your MySQL endpoint, user and password.
<property>
<name>sqoop.metastore.client.enable.autoconnect</name>
<value>true</value>
<description>If true, Sqoop will connect to a local metastore
for job management when no other metastore arguments are
provided.
</description>
</property>
<!--
The auto-connect metastore is stored in ~/.sqoop/. Uncomment
these next arguments to control the auto-connect process with
greater precision.
-->
<property>
<name>sqoop.metastore.client.autoconnect.url</name>
<value>jdbc:mysql://your-mysql-instance-endpoint:3306/database</value>
<description>The connect string to use when connecting to a
job-management metastore. If unspecified, uses ~/.sqoop/.
You can specify a different path here.
</description>
</property>
<property>
<name>sqoop.metastore.client.autoconnect.username</name>
<value>${sqoop-user}</value>
<description>The username to bind to the metastore.
</description>
</property>
<property>
<name>sqoop.metastore.client.autoconnect.password</name>
<value>${sqoop-pass}</value>
<description>The password to bind to the metastore.
</description>
</property>
When you execute the command sqoop job --list
in first time it will return zero values. But after creating the jobs, if you shutdown the EMR, you don't lose the sqoop metadata from executing jobs.
In EMR, we can use the Bootstrap Action to automate this operation in cluster creation.
Upvotes: 1
Reputation: 13773
As per the Sqoop docs,
If an incremental import is run from the command line, the value which should be specified as --last-value in a subsequent incremental import will be printed to the screen for your reference. If an incremental import is run from a saved job, this value will be retained in the saved job. Subsequent runs of sqoop job --exec someIncrementalJob will continue to import only newer rows than those previously imported.
So, you need to store nothing. Sqoop's metastore will take care of saving last value and avail for next incremental import job.
Example,
sqoop job \
--create new_job \
-- \
import \
--connect jdbc:mysql://localhost/testdb \
--username xxxx \
--password xxxx \
--table employee \
--incremental append \
--check-column id \
--last-value 0
And start this job with the --exec
parameter:
sqoop job --exec new_job
Upvotes: 2