Reputation: 4487
Update: Alright, it turns out the reason that the below isn't working is because I'm using a newer version of the InputFormat
API (import org.apache.hadoop.mapred
which is the old versus import org.apache.hadoop.mapreduce
which is the new). The problem I have is porting the existing code to new code. Has anyone had experience writing a multi-line InputFormat
using the old API?
Trying to process Omniture's data log files with Hadoop/Hive. The file format is tab delimited and while being pretty simple for the most part, they do allow you to have multiple new lines and tabs within a field that are escaped by a backslash (\\n
and \\t
). As a result I've opted to create my own InputFormat to handle the multiple newlines and convert those tabs to spaces when Hive is going to try to do a split on the tabs. I've just tried loading some sample data into the table in Hive and got the following error:
CREATE TABLE (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT 'OmnitureDataFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat';
FAILED: Error in semantic analysis: line 1:14 Input Format must implement InputFormat omniture_hit_data
The odd thing is that my input format does extend org.apache.hadoop.mapreduce.lib.input.TextInputFormat
(https://gist.github.com/4a380409cd1497602906).
Does Hive require that you extend org.apache.hadoop.hive.ql.io.HiveInputFormat
instead? If so, do I have to rewrite any of my existing class code for the InputFormat and RecordReader or can I effectively just change the class it's extending?
Upvotes: 4
Views: 13220
Reputation: 4487
Figured this out after looking at the code for LineReader and TextInputFormat. Created a new InputFormat to deal with this as well as an EscapedLineReader.
https://github.com/msukmanowsky/OmnitureDataFileInputFormat
Upvotes: 3