kumar saurabh
kumar saurabh

Reputation: 33

HIVE : Insert Query Fails with error "java.lang.OutOfMemoryError: GC overhead limit exceeded"

My Hive Insert Query is getting failed with the below error : java.lang.OutOfMemoryError: GC overhead limit exceeded

Data in table2 = 1.7tb Query :

set hive.exec.dynamic.partition.mode= nonstrict;set hive.exec.dynamic.partition=true;set mapreduce.map.memory.mb=15000;set mapreduce.map.java.opts=-Xmx9000m;set mapreduce.reduce.memory.mb=15000;set mapreduce.reduce.java.opts=-Xmx9000m;set hive.rpc.query.plan=true;
insert into database1.table1 PARTITION(trans_date) select * from database1.table2;

Error info: Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. GC overhead limit exceeded

cluster info : total memory : 1.2TB total vcores :288 total nodes : 8 node version : 2.7.0-mapr-1808

Please note : I am trying to insert the data from table 2 which is in parquet format to table 1 which is in ORC format . the data size is 1.8TB in total.

Upvotes: 2

Views: 2040

Answers (1)

leftjoin
leftjoin

Reputation: 38335

Adding distribute by partition key should solve the problem:

insert into database1.table1 PARTITION(trans_date) select * from database1.table2
distribute by trans_date;

distribute by trans_date will trigger reducer step, and each reducer will process single partition, this will reduce pressure on memory. When each process writing many partitions, it keeps too many buffers for ORC in memory.

Also consider adding this setting to control how much data each reducer will process:

set hive.exec.reducers.bytes.per.reducer=67108864; --this is example only, reduce the figure to increase parallelism

Upvotes: 1

Related Questions