Reputation: 313
I'm trying to import data directly from mysql to parquet but it doesn't seem to work correctly...
I'm using CDH5.3 which includes Sqoop 1.4.5.
Here is my command line :
sqoop import --connect jdbc:mysql://xx.xx.xx.xx/database --username username --password mypass --query 'SELECT page_id,user_id FROM pages_users WHERE $CONDITIONS' --split-by page_id --hive-import --hive-table default.pages_users3 --target-dir hive_pages_users --as-parquetfile
Then I get this error :
Warning: /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/01/09 14:31:49 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.0
15/01/09 14:31:49 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/01/09 14:31:49 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
15/01/09 14:31:49 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
15/01/09 14:31:49 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
15/01/09 14:31:49 INFO tool.CodeGenTool: Beginning code generation
15/01/09 14:31:50 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE (1 = 0)
15/01/09 14:31:50 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE (1 = 0)
15/01/09 14:31:50 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE (1 = 0)
15/01/09 14:31:50 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
Note: /tmp/sqoop-root/compile/b90e7b492f5b66554f2cca3f88ef7a61/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
15/01/09 14:31:51 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/b90e7b492f5b66554f2cca3f88ef7a61/QueryResult.jar
15/01/09 14:31:51 INFO mapreduce.ImportJobBase: Beginning query import.
15/01/09 14:31:51 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
15/01/09 14:31:51 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE (1 = 0)
15/01/09 14:31:51 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE (1 = 0)
15/01/09 14:31:51 WARN spi.Registration: Not loading URI patterns in org.kitesdk.data.spi.hive.Loader
15/01/09 14:31:51 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI: hive?dataset=default.pages_users3
org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI: hive?dataset=default.pages_users3
at org.kitesdk.data.spi.Registration.lookupDatasetUri(Registration.java:109)
at org.kitesdk.data.Datasets.create(Datasets.java:189)
at org.kitesdk.data.Datasets.create(Datasets.java:240)
at org.apache.sqoop.mapreduce.ParquetJob.createDataset(ParquetJob.java:81)
at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:70)
at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:112)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:262)
at org.apache.sqoop.manager.SqlManager.importQuery(SqlManager.java:721)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:499)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
I have no problem importing data to hive file format but parquet is a problem... Do you have any idea why this occurs ?
Thank you :)
Upvotes: 2
Views: 33656
Reputation: 67
(ps. for Oracle users: I had to connect as owner of the source table, otherwise had to specify "MRT_OWNER.TIME_DIM", and was getting error org.kitesdk.data.ValidationException: Namespace MRT_OWNER.TIME_DIM is not alphanumeric (plus '_'), seems a sqoop bug).
This can be fixed if database name and table name is written as db_name/table_name instead of db_name.table_name.
Upvotes: 1
Reputation: 14891
Here's my pipeline in CDH 5.5 to import from a jdbc into Hive parquet files. JDBC data source is for Oracle, but explanation below fits MySQL too.
1) Sqoop:
$ sqoop import --connect "jdbc:oracle:thin:@(complete TNS descriptor)" \
--username MRT_OWNER -P \
--compress --compression-codec snappy \
--as-parquetfile \
--table TIME_DIM \
--warehouse-dir /user/hive/warehouse \
--num-mappers 1
I chose --num-mappers as 1 because TIME_DIM table had just around ~20k rows, and it's not advised to split parquet table into multiple files for such a small dataset. Each mapper creates a separate output (parquet) file.
(ps. for Oracle users: I had to connect as owner of the source table, otherwise had to specify "MRT_OWNER.TIME_DIM", and was getting error org.kitesdk.data.ValidationException: Namespace MRT_OWNER.TIME_DIM is not alphanumeric (plus '_'), seems a sqoop bug).
(ps2. Table name had to be all-uppercase.. not sure if this is Oracle specific (shouldn't be) and if this is another sqoop bug).
(ps3. --compress --compression-codec snappy parameters were recognized but did not seem made any effect)
2) Above command creates a directory named
/user/hive/warehouse/TIME_DIM
It's a wise idea to move it to a specific Hive database directory, e.g.:
$ hadoop fs -mv /hivewarehouse/TIME_DIM /hivewarehouse/dwh.db/time_dim
Assuming name of Hive database/schema is "dwh".
3) Create Hive table, by taking schema directly from parquet file:
$ hadoop fs -ls /user/hive/warehouse/dwh.db/time_dim | grep parquet
-rwxrwx--x+ 3 hive hive 1216 2016-02-04 23:56 /user/hive/warehouse/dwh.db/time_dim/62679a1c-b848-426a-bb8e-9372328ddad7.parquet
If above command returns more than parquet file (it means you had more than one mapper, the --num-mappers parameter), you can pick any parquet file into the below command.
This command should run in Impala and not in Hive. Hive currently can't infer schema from parquet files, but Impala can:
[impala-shell] > CREATE TABLE dwh.time_dim
LIKE PARQUET '/user/hive/warehouse/dwh.db/time_dim/62679a1c-b848-426a-bb8e-9372328ddad7.parquet'
COMMENT 'sqooped from MRT_OWNER.TIME_DIM'
STORED AS PARQUET
LOCATION 'hdfs:///user/hive/warehouse/dwh.db/time_dim'
;
ps. It's also possible to infer schema from parquet using Spark, e.g.
spark.read.schema('hdfs:///user/hive/warehouse/dwh.db/time_dim')
4) Since table wasn't created in Hive (which collects stats automatically), it's a good idea to collect stats:
[impala-shell] > compute stats dwh.time_dim;
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_literal_sqoop_import_literal
Upvotes: 5
Reputation: 101
Please do not use <db>.<table>
with --hive-table
. This doesn't work well with Parquet import. Sqoop uses Kite SDK to write Parquet files and it doesn't like this <db>.<table>
format.
Instead, please use --hive-database --hive-table . for your command, it should be:
sqoop import --connect jdbc:mysql://xx.xx.xx.xx/database \
--username username --password mypass \
--query 'SELECT page_id,user_id FROM pages_users WHERE $CONDITIONS' --split-by page_id \
--hive-import --hive-database default --hive-table pages_users3 \
--target-dir hive_pages_users --as-parquetfile
Upvotes: 10
Reputation: 313
I found a solution, I droppped all the hive parts and use the target dir to store the data... Seems to work :
sqoop import --connect jdbc:mysql://xx.xx.xx.xx/database --username username --password mypass --query 'SELECT page_id,user_id FROM pages_users WHERE $CONDITIONS' --split-by page_id --target-dir /home/cloudera/user/hive/warehouse/soprism.db/pages_users3 --as-parquetfile -m 1
I then link to the directory making an external table from Impala...
Upvotes: 0
Reputation: 891
Seems like database support is missing in your distribution. It looks like it was added rather recently. Try setting --hive-table
to --hive-table pages_users3
and removing --target-dir
.
If the above doesn't work, try:
Upvotes: 0