phenderbender
phenderbender

Reputation: 785

Best option for merging multiple files within the same partition in hadoop?

I have a table partitioned on event_date, and for some reason, when I inserted data into my external table, some dates have only one or two files, while some have over 200.

I always use this snippet of code when kicking off the Hive queries to insert the data, so I'm not sure where/how it went out of whack for some dates, but not others. I thought the 'merge.tezfiles' line specifically is what handled the file merging upon insert.

SET mapred.job.queue.name=my_directory;
use this_directory;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions=2000;
SET hive.exec.max.dynamic.partitions.pernode=2000;
SET hive.merge.tezfiles=true;

Everything I've found online mentions having to copy the files locally and uploading them again.

Is there a way to merge the multiple files within each date partition in a clean and simple way?

I have tried the following on a few dates which had 4 and 15 files, respectively. The Hive output after it ran confirmed that the extraneous files had been deleted, but when I went back and looked in Hadoop, there were just as many as when I started. Luckily the data was still accurate when I checked it, so I'm not sure what it deleted in the first place? Is this not at all the right command to be using?

alter table table_being_edited PARTITION(event_dt='2017-01-01') CONCATENATE;  

Here is one such line where it confirmed the extra files had been removed:

Moved: 'my_hdfs_filepath/event_dt=2019-10-24/000052_0' to trash at: my_trash_directory/.Trash/Current

OK Time taken: 75.321 seconds

For the date that had 15 files, it gave me a similar output 15x.

I'm hoping to narrow down the dates with many files down to just one or two if at all possible, as we are running out of namespace. I'm very new to all of this, so is there any barebones, simple way of merging files within a single date partition?

Upvotes: 3

Views: 4521

Answers (4)

der_grund
der_grund

Reputation: 1938

I've had success with adding

DISTRIBUTE BY <partition_column>

to the INSERT statement.

Upvotes: 0

Hazhir
Hazhir

Reputation: 51

If your block size of HDFS/MapR-FS is 256MB, it is better to set the smallfiles.avgsize to 256MB

SET hive.merge.tezfiles=true; --Merge small files at the end of a Tez DAG.
SET hive.merge.mapfiles=true; --Hive will start an additional map-reduce job to merge the output files into bigger files
SET hive.merge.mapredfiles=true; --Hive will start an additional map-reduce job to merge the output files into bigger files
SET hive.merge.orcfile.stripe.level=true; --When hive.merge.mapfiles, hive.merge.mapredfiles or hive.merge.tezfiles is enabled while writing a table with ORC file format, enabling this configuration property will do stripe-level fast merge for small ORC files.
SET hive.merge.size.per.task=256000000; --Size of merged files at the end of the job.
SET hive.merge.smallfiles.avgsize=256000000; --When the average output file size of a job is less than this number, Hive will start an additional map-reduce job to merge the output files into bigger files. This is only done for map-only jobs if hive.merge.mapfiles is true, and for map-reduce jobs if hive.merge.mapredfiles is true.

Upvotes: 1

phenderbender
phenderbender

Reputation: 785

By adding this line in addition to my other SET hive paramaters, I was able to consistently merge part files into a single file of size 5 gigabytes or less, upon inserting them into a new table:

set hive.merge.smallfiles.avgsize=5000000000;

It is also possible to use getmerge and then put the files back, but this requires additional steps of pulling the files down locally (must have lots of storage depending on the size of your files) which was more cumbersome than creating a new table, and inserting with this additional SET parameter.

The other alternative was using

set hive.merge.mapfiles=true;

Which seems to be a parameter to create no. of mappers. if we have small numbers of files it has to create that many mappers which is not optimal for hadoop design, thus the tez merge option is more suitable

Upvotes: 0

Naga
Naga

Reputation: 426

You can try setting below properties


SET hive.merge.mapfiles=true;
SET hive.merge.mapredfiles=true;
SET hive.merge.smallfiles.avgsize=134217728; ( 128 MB)

You can refer to this link

Upvotes: 0

Related Questions