Reputation: 739
I have written insert overwrite partition in hive to merge all the files in a partition into bigger file,
SQL:
SET hive.exec.compress.output=true;
set hive.merge.smallfiles.avgsize=2560000000;
set hive.merge.mapredfiles=true;
set hive.merge.mapfiles =true;
SET mapreduce.max.split.size=256000000;
SET mapreduce.min.split.size=256000000;
SET mapreduce.output.fileoutputformat.compress.type =BLOCK;
SET hive.hadoop.supports.splittable.combineinputformat=true;
SET mapreduce.output.fileoutputformat.compress.codec=${v_compression_codec};
INSERT OVERWRITE TABLE ${source_database}.${table_name} PARTITION (${line}) \n SELECT ${prepare_sel_columns} \n from ${source_database}.${table_name} \n WHERE ${partition_where_clause};\n"
With the above setting I am getting the compressed output but the time it takes to generate the output file is too long.
Even though it runs map only jobs , Takes much time.
Looking for any further setting from hive side to tune the Insert to run faster.
Metrics.
15 GB files ==> taking 10 min.
Upvotes: 2
Views: 2263
Reputation: 739
SET hive.exec.compress.output=true; SET mapreduce.input.fileinputformat.split.minsize=512000000; SET mapreduce.input.fileinputformat.split.maxsize=5120000000; SET mapreduce.output.fileoutputformat.compress.type =BLOCK; SET hive.hadoop.supports.splittable.combineinputformat=true; SET mapreduce.output.fileoutputformat.compress.codec=${v_compression_codec};
The above setting helped lot , The duration came down from 10 min to 1 min.
Upvotes: 1