Hive dynamic partitions generate multiple files

Question

I have a number of Hive jobs that run during a day. The job is outputting data to Amazon S3. The Hive job uses dynamic partitioning.

The problem is that when different jobs need to write to the same dynamic partitioned, they will each generate one file.

What I would like is for the subsequent jobs to load the existing data and merge it with the new data.

I should mention that the query that actually outputs to S3 is an INSERT INTO TABLE query.

Joe K · Accepted Answer

Without rewriting all of the data every time, this certainly isn't possible in Hadoop 1.x, and would be very difficult in 2.0.

Fundamentally, hadoop 1.x does not support file appends. If a new process comes along and wants to write to a directory, it must create new files; it's impossible to append to already-existing ones.

Even if it were possible to append (as in 2.0), there would be many race conditions and other things for hive to worry about. It's a very difficult problem.

However, this is a common issue. The typical solution is to let your process add the new files, and periodically run a "compaction" job that just does something like:

insert overwrite table my_table partition (foo='bar')
select * from my_table where foo = 'bar'
distribute by foo;

This should force just one file to be created. However, again you should worry about race conditions. Either make sure you have locking enabled, or only compact partitions that you are sure are not being written to.

Hive dynamic partitions generate multiple files

Answers (2)

Related Questions