Reputation: 2400
I have an external table that reads data from the HDFS location (/user/hive/warehouse/tableX) all files and created a external table in Hive.
Now, let's assume there's some pre-partitioning of the data and all the previous files are spitted in several directories with a specific name convention <dir_name>_<incNumber> e.g.
/user/hive/warehouse/split/
./dir_1/files...
./dir_2/files...
./dir_n/files...
how can I create another external table that keeps track of all files in the split folder?
Do I need to create an external table that is partitioned on each sub-folder (dir_x)?
Also for that, is it needed some kind of Hive or shell script that can create/add a partition for each sub-directory?
Upvotes: 5
Views: 5082
Reputation: 91
Once you have your external table in place and you know the partition structure that hive would create. You can load the data files (into pre-partitioned folders) in your drive through some process. Once the pre-partitioned file is available (as per hive known partition structure), you can repair the table as below.
hive> MSCK REPAIR TABLE sample_table;
The above statement will automatically sync up all your existing partitions to the hive meta store of the existing EXTERNAL TABLE "sample_table".
Upvotes: 2
Reputation: 1658
You have to create an external table partitioned by dir_x to access all files in multiple folders.
CREATE external TABLE sample_table( col1 string,
col2 string,
col3 string,
col4 string)
PARTITIONED BY (dir string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/split';
Then add the partition as you would to a regular partitioned table
ALTER TABLE sample_table ADD PARTITION(dir='dir_1')
LOCATION '/user/hive/warehouse/split/dir_1';
ALTER TABLE sample_table ADD PARTITION(dir='dir_2')
LOCATION '/user/hive/warehouse/split/dir_2';
This approach will work. There is an issue with this approach. If some time in the future you decide to add a new folder(e.g. dir_100) to the hive warehouse path, you will have to drop and recreate sample_table and re add all the partitions to sample_table again using ALTER TABLE statement. I haven't worked with hive for about 10 months now, so i am not sure if there is a better approach. If this is not an issue, you can use this approach.
Upvotes: 5