Mike
Mike

Reputation: 7193

Hive output larger than dfs blocksize limit

I have a table test which was created in hive. It is partitioned by idate and often partitions need to be inserted into. This can leave files on hdfs which have only a few rows.

hadoop fs -ls /db/test/idate=1989-04-01
Found 3 items
-rwxrwxrwx   3 deployer   supergroup        710 2015-04-26 11:33 /db/test/idate=1989-04-01/000000_0
-rwxrwxrwx   3 deployer   supergroup        710 2015-04-26 11:33 /db/test/idate=1989-04-01/000001_0
-rwxrwxrwx   3 deployer   supergroup        710 2015-04-26 11:33 /db/test/idate=1989-04-01/000002_0

I am trying to put together a simple script to combine these files, to avoid having many small files on my partitions:

insert overwrite table test partition (idate)
select * from test
where idate = '1989-04-01'
distribute by idate

This works, it creates the new file with all the rows from the old one. The problem is when I run this script on larger partitions, the output is still a single file:

hadoop fs -ls /db/test/idate=2015-04-25
Found 1 items
-rwxrwxrwx   3 deployer   supergroup 1400739967 2015-04-27 10:53 /db/test/idate=2015-04-25/000001_0

This file is over 1 GB in size, but the block size is set to 128 MB:

hive> set dfs.blocksize;
dfs.blocksize=134217728

I could manually set the number of reducers to keep the block size small, but shouldn't this be split up automatically? Why is hive creating files larger than the allowed block size?


NOTE These are compressed rcfiles so I can't just cat them together.

Upvotes: 3

Views: 2978

Answers (2)

Ashrith
Ashrith

Reputation: 6855

It's alright to have a large file that is in splittable format, as downstream job's can split that file based on block size. Generally, you will get 1 output file per reducer, to get more reducers, you should define bucketing on your table. Tune the # buckets to get the files of the size you want? For your bucket column, pick a high cardinality column that you will likely join on as your candidate.

Upvotes: 1

Mike
Mike

Reputation: 7193

Alright I have seen the error in my thinking. My mistake was in assuming that the files listed by hdfs were the actual blocks. This is not the case. The 1 GB file is broken up into blocks under the hood, there is nothing wrong with having a single file per partition, the mappers can still parallelize when reading through the underlying blocks.

Upvotes: 0

Related Questions