Hive output larger than dfs blocksize limit

Question

I have a table test which was created in hive. It is partitioned by idate and often partitions need to be inserted into. This can leave files on hdfs which have only a few rows.

hadoop fs -ls /db/test/idate=1989-04-01
Found 3 items
-rwxrwxrwx   3 deployer   supergroup        710 2015-04-26 11:33 /db/test/idate=1989-04-01/000000_0
-rwxrwxrwx   3 deployer   supergroup        710 2015-04-26 11:33 /db/test/idate=1989-04-01/000001_0
-rwxrwxrwx   3 deployer   supergroup        710 2015-04-26 11:33 /db/test/idate=1989-04-01/000002_0

I am trying to put together a simple script to combine these files, to avoid having many small files on my partitions:

insert overwrite table test partition (idate)
select * from test
where idate = '1989-04-01'
distribute by idate

This works, it creates the new file with all the rows from the old one. The problem is when I run this script on larger partitions, the output is still a single file:

hadoop fs -ls /db/test/idate=2015-04-25
Found 1 items
-rwxrwxrwx   3 deployer   supergroup 1400739967 2015-04-27 10:53 /db/test/idate=2015-04-25/000001_0

This file is over 1 GB in size, but the block size is set to 128 MB:

hive> set dfs.blocksize;
dfs.blocksize=134217728

I could manually set the number of reducers to keep the block size small, but shouldn't this be split up automatically? Why is hive creating files larger than the allowed block size?

NOTE These are compressed rcfiles so I can't just cat them together.

Ashrith · Accepted Answer

It's alright to have a large file that is in splittable format, as downstream job's can split that file based on block size. Generally, you will get 1 output file per reducer, to get more reducers, you should define bucketing on your table. Tune the # buckets to get the files of the size you want? For your bucket column, pick a high cardinality column that you will likely join on as your candidate.

Hive output larger than dfs blocksize limit

Answers (2)

Related Questions