Ragzz
Ragzz

Reputation: 103

hive merge properties not working for small files

I am trying to insert data into dynamic partitioned table which is creating lots of small files, i have set hive properties as below but i still see small files in partitioned folder, the size per task nor the avgfile size seems to be working for me as the files in partitioned folder are above the size per task i gave. Any help will be greatly appreciated hive.merge.mapfiles=true; hive merge mapredfiles = true hive.merge.size.per.task=10000; hive.merge.smallfiles.avgsize=100;

Upvotes: 3

Views: 4224

Answers (3)

Amit khandelwal
Amit khandelwal

Reputation: 534

Working with Small files in hive is a common problem and it can also be resolved by using CombineHiveInputFormat for input format. Also use ORC files by deafault: set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

This will help to run hive job faster for given small files in hive.

Upvotes: 0

tonizz
tonizz

Reputation: 116

This can happen when you execute multiple inserts into a single Hive table. 1 single insert can result in one or more files under the HDFS location.

I have managed this situation by executing below command - this will compact the table and will merge all files in one (or bigger ones)

There's one restriction though, you can't have indexes in your hive tables to execute the merge command.

I have also tested from Spark SQL over ORC files - (1.5.2) and it works fine.

ALTER TABLE schema.table PARTITION (month = '01') CONCATENATE

Hope it helps

Upvotes: 0

Jared
Jared

Reputation: 2954

Your example shows you setting the average size to 100 bytes which would create a lot of small files and is most likely being ignored because the files are already larger than that. Try increasing this value to an average of 128MB(134217728) which should on average increase the size of the files being merged after your job is complete.

set hive.merge.smallfiles.avgsize = 134217728;

Upvotes: 2

Related Questions