Reputation: 1469
I'm writing custom InputFormat (specifically, a subclass of org.apache.hadoop.mapred.FileInputFormat
), OutputFormat, and SerDe for use with binary files to be read in through Apache Hive. Not all records within the binary file have the same size.
I'm finding that Hive's default InputFormat, CombineHiveInputFormat, is not delegating getSplits
to my custom InputFormat's implementation, which causes all input files to be split on regular 128MB boundaries. The problem with this is that this split may be in the middle of a record, so all splits but the first are very likely to appear to have corrupt data.
I've already found a few workarounds, but I'm not pleased with any of them.
One workaround is to do:
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
When using HiveInputFormat
over CombineHiveInputFormat
, the call to getSplits
is correctly delegated to my InputFormat and all is well. However, I want to make my InputFormat, OutputFormat, etc. easily available to other users, so I'd prefer not to have to go through this. Additionally, I'd like to be able to take advantage of combining splits if possible.
Yet another workaround is to create a StorageHandler
. However, I'd prefer not to do this, since this makes all tables backed by the StorageHandler non-native (so all reducers write out to one file, cannot LOAD DATA
into the table, and other nicities I'd like to preserve from native tables).
Finally, I could have my InputFormat implement CombineHiveInputFormat.AvoidSplitCombination
to bypass most of CombineHiveInputFormat, but this is only available in Hive 1.0, and I'd like my code to work with earlier versions of Hive (at least back to 0.12).
I filed a ticket in the Hive bug tracker here, in case this behavior is unintentional: https://issues.apache.org/jira/browse/HIVE-9771
Has anyone written a custom FileInputFormat
that overrides getSplits
for use with Hive? Was there ever any trouble getting Hive to delegate the call to getSplits
that you had to overcome?
Upvotes: 1
Views: 806
Reputation: 2725
Typically in this situation you leave the splits alone so that you can get data locality for the blocks, and have your RecordReader
understand how to start the reading from the first record in the block (split) and to read into the next block where the final record does not end at the exact end of the split. This takes some remote reads but it is normal and usually very minimal.
TextInputFormat
/LineRecordReader
does this - it uses newline to delimit records, so naturally a record can span two blocks. It will traverse to the first record in the split instead of starting at the first character, and on the last record it will read into the next block if necessary to read the complete data.
Where LineRecordReader
starts the split by seeking past the current partial record.
Where LineRecordReader
ends the split by reading past the end of the current block.
Hope that helps direct the design of your custom code.
Upvotes: 1