Reputation: 7605

Sqoop - What is the best option to import a table from Oracle to Hive using the Avro format?

There are (at least) two options to import a table from Oracle to Hive using Sqoop.

First option

The create-hive-table tool populates a Hive metastore with a definition for a table based on a database table previously imported to HDFS, or one planned to be imported. In the following example we are creating the table emps

$ sqoop create-hive-table \
--connect jdbc:mysql://localhost/dualcore \
--username training \
--password training \
--table employees \
--hive-table emps

Then you perform the import into the emps table

 $ sqoop import \
 --connect jdbc:mysql://localhost/dualcore \
 --username training \
 --password training \
 --m 1 \
 --target-dir /queryresult \
 --hive-table emps \
 --hive-import

https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_literal_sqoop_create_hive_table_literal

The second option (which looks more efficient)

With the command --hive-import which:

Creates the table in Hive (metastore)
Imports data from RDBMS to table’s directory in HDFS

example:

 $ sqoop import \
 --connect jdbc:mysql://localhost/dualcore \
 --username training \
 --password training \
 --m 1 \
 --target-dir /queryresult \
 --table employees \
 --hive-import

Which one of these two options is better?

Upvotes: 0

Answers (1)

Ignacio Alorre

Reputation: 7605

Even if the second option looks more efficient, there are some cases where it is not possible.

One case I had to face is when you need to create an external table in Hive. Since Sqoop does not support creating Hive external tables, another workaround was necessary.

Off Topic - Internal vs External Tables in Hive

External table in HIVE

(stores data on HDFS)

What are external tables

External table stores files on the HDFS server but tables are not linked to the source file completely.
If you delete an external table the file still remains on the HDFS server.
The file and the table link is there but read only.
As an example if you create an external table called “my_test” in HIVE using HIVE-QL and link the table to file “flat_file.txt”, then deleting “my_test” from HIVE,this will not delete “flat_file.txt” from HDFS.
External table files are accessible to anyone who has access to HDFS file structure and therefore security needs to be managed at the HDFS file/folder level.
Meta data is maintained on master node and deleting an external table from HIVE, only deletes the metadata not the data/file.

Use external table if you:

Want to manage the data outside HIVE. So the data files are read and processed by an existing program that doesn’t lock the files e.g. you are planning to use an ETL tool to load/merge data files etc.
Want to load the latest information to the table but still want to retain old dataset in a file on HDFS for regulatory/legal purposes.
Are not planning to create a table from another table schema e.g. Create table1 as (Select * from table2)
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
You want to use a custom location such as ASV.
Hive should not own data and control settings, dirs, etc., you have another program or process that will do those things.

Internal table in HIVE

(stores data on HDFS but in a kind of restricted area)

What are internal tables

Internal table file security is controlled solely via HIVE. E.g. tbl_batting can only be accessed via HDFS internal account:
r = read w = write x = execute
Deleting the table deletes the metadata & data from masternode and HDFS respectively
Security needs to be managed within HIVE, probably at the schema level (depends on organisation to organisation). HDFS security is out of scope in this case.

Use internal table if you:

Want to store the data temporary.
Want to use HIVE to manage the lifecycle of tables and data.

Sources:

SOLUTION (With a Workaround to create an External Table in Hive with Sqoop)

It is necessary to import a sample of the table from Oracle to create an avro file, which will be used to generate the hive table. Then you import the whole table from oracle to hdfs.

1 To generate the Avro file

sqoop import --connect jdbc:oracle:thin:@//url:port/database \ --username my_user --password my_password -m 1 --fetch-size 10 \ --table table_to_import \ --columns columns_to_import \ --create-hive-table --hive-table dummy_table \ --as-avrodatafile --target-dir hdfs:///user/my_user/dummytab/

2 Generate the schema in local

avro-tools getschema hdfs://urlhadoop:port/user/my_user/dummytab/part-m-00000.avro

schematab.avsc

3 Move the schema back to the hdfs

hdfs dfs -copyFromLocal schematab.avsc /user/my_user/schemas

4 Create the table in Hive

hive -e "CREATE EXTERNAL TABLE MyHiveTable ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 'hdfs:///user/my_user/tables/' TBLPROPERTIES ('avro.schema.url'='hdfs:///user/my_user/schemas/schematab.avsc');"

5 Import All Data

sqoop import --connect jdbc:oracle:thin:@//url:port/database \ --username my_user --password my_password -m 1 \ --table table_to_import \ --columns columns_to_import \ --hive-table MyHiveTable \ --as-avrodatafile --target-dir hdfs:///user/my_user/tables/ --append

Source: http://wpcertification.blogspot.com/2015/05/importing-data-from-sqoop-into-hive.html?_sm_au_=iVV10VW0vW1Z7Z42

Upvotes: 1

Sqoop - What is the best option to import a table from Oracle to Hive using the Avro format?

First option

The second option (which looks more efficient)

Answers (1)

Off Topic - Internal vs External Tables in Hive

External table in HIVE

Internal table in HIVE

SOLUTION (With a Workaround to create an External Table in Hive with Sqoop)

1 To generate the Avro file

2 Generate the schema in local

3 Move the schema back to the hdfs

4 Create the table in Hive

5 Import All Data

Related Questions