Reputation: 726
I have an original table that is partitioned on YEAR, MONTH and DATE. For example:
col_1 col_2 col_3 YEAR MONTH DATE
a b c 2017 03 25
I want to create a new table which is a subset of this table, but still maintains the partitioning of the original table.
Something as simple as
CREATE new_table AS
SELECT *
FROM original_table
WHERE (conditions);
However, because the original table is so big, I have to iterate through this query via the partitions.
My current solution is to write a shell script that iterates through all partitions and runs a separate query for each partition.
Example:
for year in '2016' '2017'
do
for month in '01' '02' ...
do
for day in '01' '02' ...
do
hive -e "INSERT INTO new_table SELECT * FROM original_table WHERE YEAR=$year AND MONTH=$month etc."
done
done
done
But this appears very round-about and inefficient. Is there a way to do this as one line directly in hive?
Upvotes: 2
Views: 13355
Reputation: 191973
I recently had to do something like this for a subset of days, but it requires you to copy the schema of the original table, or at least use CREATE TABLE LIKE
But, most importantly, you're insert statement needs to specify the partitions
CREATE TABLE new_table (
fields...
)
PARTITIONED BY (year STRING, month STRING, day STRING);
INSERT OVERWRITE TABLE new_table PARTITION(year, month, day)
SELECT fields... , year, month, day -- partitions must be last
FROM original_table
WHERE
year BETWEEN '2016' AND '2017'; -- add more, as necessary
You might also be able to use a CTAS function, but doing those with partitioned tables isn't straightforward
Upvotes: 6