Reputation: 33
My Hive Insert Query is getting failed with the below error : java.lang.OutOfMemoryError: GC overhead limit exceeded
Data in table2 = 1.7tb Query :
set hive.exec.dynamic.partition.mode= nonstrict;set hive.exec.dynamic.partition=true;set mapreduce.map.memory.mb=15000;set mapreduce.map.java.opts=-Xmx9000m;set mapreduce.reduce.memory.mb=15000;set mapreduce.reduce.java.opts=-Xmx9000m;set hive.rpc.query.plan=true;
insert into database1.table1 PARTITION(trans_date) select * from database1.table2;
Error info: Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. GC overhead limit exceeded
cluster info : total memory : 1.2TB total vcores :288 total nodes : 8 node version : 2.7.0-mapr-1808
Please note : I am trying to insert the data from table 2 which is in parquet format to table 1 which is in ORC format . the data size is 1.8TB in total.
Upvotes: 2
Views: 2040
Reputation: 38335
Adding distribute by partition key should solve the problem:
insert into database1.table1 PARTITION(trans_date) select * from database1.table2
distribute by trans_date;
distribute by trans_date
will trigger reducer step, and each reducer will process single partition, this will reduce pressure on memory. When each process writing many partitions, it keeps too many buffers for ORC in memory.
Also consider adding this setting to control how much data each reducer will process:
set hive.exec.reducers.bytes.per.reducer=67108864; --this is example only, reduce the figure to increase parallelism
Upvotes: 1