Tuomas Tikka
Tuomas Tikka

Reputation: 203

Optimizing Sqoop data import from MySQL to Hive using import-all-tables

I am using Sqoop 1.4.6 to import data from MySQL to Hive using the import-all-tables option. The result is ok, but the import process itself is quite slow. For example one of the databases contains 40-50 tables with well under 1 million rows in total, and takes around 25-30 minutes to complete. Upon investigating, it seems most of the time is spent initialising Hive for each imported table. Testing a plain mysqldump on the same database completes in under 1 minute. So the question is how to reduce this initialisation time, if that is the case, for example using a single Hive session?

The import command is:

sqoop import-all-tables -Dorg.apache.sqoop.splitter.allow_text_splitter=true --compress --compression-codec=snappy --num-mappers 1 --connect "jdbc:mysql://..." --username ... --password ... --null-string '\\N' --null-non-string '\\N' --hive-drop-import-delims --hive-import --hive-overwrite --hive-database ... --as-textfile --exclude-tables ... --warehouse-dir=...

Update:

Sqoop version: 1.4.6.2.5.3.0-37

Hive version: 1.2.1000.2.5.3.0-37

Could be related to:

https://issues.apache.org/jira/browse/HIVE-10319

Upvotes: 1

Views: 3103

Answers (1)

Ronak Patel
Ronak Patel

Reputation: 3849

remove option --num-mappers 1 to run import with default 4 mappers OR change it to some higher number --num-mappers 8 (if hardware allows) - this is to run import with more parellel jobs for tables having primary key, AND use --autoreset-to-one-mapper option - it will use 1 mapper for table not having primary key. Also use --direct mode:

sqoop import-all-tables \
--connect "jdbc:mysql://..." --username ... \
--password ... \
-Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--compress --compression-codec=snappy \
--num-mappers 8 \
--autoreset-to-one \ 
--direct \
--null-string '\\N' 
...

let us know if this improve the performance...


Update:

--fetch-size=<n> - Where represents the number of entries that Sqoop must fetch at a time. Default is 1000.

Increase the value of the fetch-size argument based on the volume of data that need to read. Set the value based on the available memory and bandwidth.


increasing mapper memory from current value to some higher number: example: sqoop import-all-tables -D mapreduce.map.memory.mb=2048 -D mapreduce.map.java.opts=-Xmx1024m <sqoop options>


Sqoop Performance Tuning Best Practices

Tune the following Sqoop arguments in JDBC connection or Sqoop mapping to optimize performance

  • batch (for export)
  • split-by and boundary-query (not needed since we are suing --autoreset-to-one-mapper, can't be use with import-all-tables)
  • direct
  • fetch-size
  • num-mapper

Upvotes: 1

Related Questions