Alg_D
Alg_D

Reputation: 2400

Hadoop - Create external table from multiple directories in HDFS

I have an external table that reads data from the HDFS location (/user/hive/warehouse/tableX) all files and created a external table in Hive.

Now, let's assume there's some pre-partitioning of the data and all the previous files are spitted in several directories with a specific name convention <dir_name>_<incNumber> e.g.

/user/hive/warehouse/split/
  ./dir_1/files...
  ./dir_2/files...
  ./dir_n/files...

how can I create another external table that keeps track of all files in the split folder?

Do I need to create an external table that is partitioned on each sub-folder (dir_x)?

Also for that, is it needed some kind of Hive or shell script that can create/add a partition for each sub-directory?

Upvotes: 5

Views: 5082

Answers (2)

Srisatya
Srisatya

Reputation: 91

Once you have your external table in place and you know the partition structure that hive would create. You can load the data files (into pre-partitioned folders) in your drive through some process. Once the pre-partitioned file is available (as per hive known partition structure), you can repair the table as below.

hive> MSCK REPAIR TABLE sample_table; 

The above statement will automatically sync up all your existing partitions to the hive meta store of the existing EXTERNAL TABLE "sample_table".

Upvotes: 2

dheee
dheee

Reputation: 1658

You have to create an external table partitioned by dir_x to access all files in multiple folders.

CREATE external TABLE sample_table( col1 string,
                                    col2 string,
                                    col3 string,
                                    col4 string)
PARTITIONED BY (dir string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/split';

Then add the partition as you would to a regular partitioned table

ALTER TABLE sample_table ADD PARTITION(dir='dir_1')
LOCATION '/user/hive/warehouse/split/dir_1';
ALTER TABLE sample_table ADD PARTITION(dir='dir_2')
LOCATION '/user/hive/warehouse/split/dir_2';

This approach will work. There is an issue with this approach. If some time in the future you decide to add a new folder(e.g. dir_100) to the hive warehouse path, you will have to drop and recreate sample_table and re add all the partitions to sample_table again using ALTER TABLE statement. I haven't worked with hive for about 10 months now, so i am not sure if there is a better approach. If this is not an issue, you can use this approach.

Upvotes: 5

Related Questions