Reputation: 2543
I have a table test_details
with some 4 million records. Using the data in this table, I have to create a new partitioned table test_details_par
with records partitioned on visit_date
. Creating the table is not a challenge, but when I come to the part where I have to INSERT the data using Dynamic Partitions, Hive gives up when I try to insert data for more number of days. If I do it for 2 or 3 days the Map Reduce jobs runs successfully but for more days it fails giving a JAVA Heap Space Error
or GC Error
.
A Simplified Snapshot of my DDLs is as follows:
CREATE TABLE test_details_par( visit_id INT, visit_date DATE, store_id SMALLINT);
INSERT INTO TABLE test_details_par PARTITION(visit_date) SELECT visit_id, store_id, visit_date FROM test_details DISTRIBUTE BY visit_date;
I have tried setting these parameters, so that Hive executes my job in a better way:
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions.pernode = 10000;
Is there anything that I am missing to run the INSERT for a complete batch without specifying the dates specifically?
Upvotes: 8
Views: 4667
Reputation: 3047
Neels,
Hive 12 and below have well-known scalability issues with dynamic partitioning that will be addressed with Hive 13. The problem is that Hive attempts to hold a file handle open for each and every partition it writes out, which causes out of memory and crashes. Hive 13 will sort by partition key so that it only needs to hold one file open at a time.
You have 3 options as I see
Upvotes: 6