PasLeChoix
PasLeChoix

Reputation: 311

what's the difference of sqoop import to hdfs and to hive?

I was able to use sqoop to import a mysql table "titles" to hdfs using command like this:

sqoop import --connect jdbc:mysql://localhost/employees --username=root -P --table=titles --target-dir=titles --m=1

Now I want to import to hive, if I use the following command:

sqoop import --connect jdbc:mysql://localhost/employees --username=root -P --table titles --hive-import

I will be prompted that:

Output directory hdfs://localhost:9000/user/root/titles already exists

In hive, if I do a show tables I get the following:

hive> show tables;
OK
dept_emp
emp
myfirsthivetable
parted1emp
partitionedemp

You can see there is no table called titles in hive

I am confused at this, for the sqoop imported data, is there any 1 to 1 relationship between hdfs and hive? What's the meaning of the prompt?

Thank you for your enlighening.

Upvotes: 0

Views: 1019

Answers (2)

Jaime Caffarel
Jaime Caffarel

Reputation: 2469

As Amit has pointed out, since you already created the HDFS directory in your first command, Sqoop refuses to overwrite the folder titles since it already contains data.

In your second command, you are telling Sqoop to import (once again) the whole table (which was already imported in the first command) into Hive. Since you are not specifying the --target-dir with the HDFS destination, Sqoop will try to create the folder titles under /user/root/. SInce this folder already exists, an error was raised.

When you tell Hive to show the tables, titles doesn't appear because the second command (with hive-import) was not successful, and Hive doesn't know anything about the data. When you add the flag --hive-import, what Sqoop does under the hood is update the Hive metastore which is a database that has the metadata of the Hive tables, partitions and HDFS location.

You could do the data import using just one Sqoop command instead of using two different ones. If you delete the titles HDFS folder and you perform something like this:

sqoop import --connect jdbc:mysql://localhost/employees --username=root 
-P --table=titles --target-dir /user/root/titles --hive-import --m=1

This way, you are pulling the data from Mysql, creating the /user/root/titles HDFS directory and updating the metastore, so that Hive knows where the table (and the data) is.

But what if you wouldn't want to delete the folder with the data that you already imported? In that case, you could create a new Hive table titles and specify the location of the data using something like this:

CREATE [TEMPORARY] [EXTERNAL] TABLE title    
  [(col_name data_type [COMMENT col_comment], ...)]
  (...)
  LOCATION '/user/root/titles'

This way, you wouldn't need to re-import the whole data again, since it's already in HDFS.

Upvotes: 2

Amit Yadav
Amit Yadav

Reputation: 54

When you create a table on hive it eventually creates a directory on HDFS, as you already ran the hadoop import first hence a directory named "titles" already been created on HDFS.

Either can you delete the /user/root/titles directory from HDFS and ran the hive import command again or use --hive-table option while import.

You can refer to the sqoop documentation.

Hope this helps.

Upvotes: 0

Related Questions