Reputation: 7605
There are (at least) two options to import a table from Oracle to Hive using Sqoop.
The create-hive-table tool populates a Hive metastore with a definition for a table based on a database table previously imported to HDFS, or one planned to be imported. In the following example we are creating the table emps
$ sqoop create-hive-table \
--connect jdbc:mysql://localhost/dualcore \
--username training \
--password training \
--table employees \
--hive-table emps
Then you perform the import into the emps table
$ sqoop import \
--connect jdbc:mysql://localhost/dualcore \
--username training \
--password training \
--m 1 \
--target-dir /queryresult \
--hive-table emps \
--hive-import
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_literal_sqoop_create_hive_table_literal
With the command --hive-import which:
example:
$ sqoop import \
--connect jdbc:mysql://localhost/dualcore \
--username training \
--password training \
--m 1 \
--target-dir /queryresult \
--table employees \
--hive-import
Which one of these two options is better?
Upvotes: 0
Views: 9589
Reputation: 7605
Even if the second option looks more efficient, there are some cases where it is not possible.
One case I had to face is when you need to create an external table in Hive. Since Sqoop does not support creating Hive external tables, another workaround was necessary.
(stores data on HDFS)
What are external tables
Use external table if you:
Are not planning to create a table from another table schema e.g. Create table1 as (Select * from table2)
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
You want to use a custom location such as ASV.
Hive should not own data and control settings, dirs, etc., you have another program or process that will do those things.
(stores data on HDFS but in a kind of restricted area)
What are internal tables
Internal table file security is controlled solely via HIVE. E.g.
tbl_batting can only be accessed via HDFS internal account:
r = read w = write x = execute
Deleting the table deletes the metadata & data from masternode and HDFS respectively
Use internal table if you:
Want to store the data temporary.
Want to use HIVE to manage the lifecycle of tables and data.
Sources:
It is necessary to import a sample of the table from Oracle to create an avro file, which will be used to generate the hive table. Then you import the whole table from oracle to hdfs.
1 To generate the Avro file
sqoop import --connect jdbc:oracle:thin:@//url:port/database \ --username my_user --password my_password -m 1 --fetch-size 10 \ --table table_to_import \ --columns columns_to_import \ --create-hive-table --hive-table dummy_table \ --as-avrodatafile --target-dir hdfs:///user/my_user/dummytab/
2 Generate the schema in local
avro-tools getschema hdfs://urlhadoop:port/user/my_user/dummytab/part-m-00000.avro
schematab.avsc
3 Move the schema back to the hdfs
hdfs dfs -copyFromLocal schematab.avsc /user/my_user/schemas
4 Create the table in Hive
hive -e "CREATE EXTERNAL TABLE MyHiveTable ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 'hdfs:///user/my_user/tables/' TBLPROPERTIES ('avro.schema.url'='hdfs:///user/my_user/schemas/schematab.avsc');"
5 Import All Data
sqoop import --connect jdbc:oracle:thin:@//url:port/database \ --username my_user --password my_password -m 1 \ --table table_to_import \ --columns columns_to_import \ --hive-table MyHiveTable \ --as-avrodatafile --target-dir hdfs:///user/my_user/tables/ --append
Upvotes: 1