Reputation: 203
I am using Sqoop 1.4.6 to import data from MySQL to Hive using the import-all-tables option. The result is ok, but the import process itself is quite slow. For example one of the databases contains 40-50 tables with well under 1 million rows in total, and takes around 25-30 minutes to complete. Upon investigating, it seems most of the time is spent initialising Hive for each imported table. Testing a plain mysqldump on the same database completes in under 1 minute. So the question is how to reduce this initialisation time, if that is the case, for example using a single Hive session?
The import command is:
sqoop import-all-tables -Dorg.apache.sqoop.splitter.allow_text_splitter=true --compress --compression-codec=snappy --num-mappers 1 --connect "jdbc:mysql://..." --username ... --password ... --null-string '\\N' --null-non-string '\\N' --hive-drop-import-delims --hive-import --hive-overwrite --hive-database ... --as-textfile --exclude-tables ... --warehouse-dir=...
Update:
Sqoop version: 1.4.6.2.5.3.0-37
Hive version: 1.2.1000.2.5.3.0-37
Could be related to:
https://issues.apache.org/jira/browse/HIVE-10319
Upvotes: 1
Views: 3103
Reputation: 3849
remove option --num-mappers 1
to run import with default 4 mappers OR change it to some higher number --num-mappers 8
(if hardware allows) - this is to run import with more parellel jobs for tables having primary key, AND use --autoreset-to-one-mapper
option - it will use 1 mapper for table not having primary key. Also use --direct
mode:
sqoop import-all-tables \
--connect "jdbc:mysql://..." --username ... \
--password ... \
-Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--compress --compression-codec=snappy \
--num-mappers 8 \
--autoreset-to-one \
--direct \
--null-string '\\N'
...
let us know if this improve the performance...
Update:
--fetch-size=<n>
- Where represents the number of entries that Sqoop must fetch at a time. Default is 1000.Increase the value of the fetch-size argument based on the volume of data that need to read. Set the value based on the available memory and bandwidth.
increasing mapper memory from current value to some higher number: example:
sqoop import-all-tables -D mapreduce.map.memory.mb=2048 -D mapreduce.map.java.opts=-Xmx1024m <sqoop options>
Sqoop Performance Tuning Best Practices
Tune the following Sqoop arguments in JDBC connection or Sqoop mapping to optimize performance
--autoreset-to-one-mapper
, can't be use with import-all-tables
) Upvotes: 1