Reputation: 7790
I have created a har-file using the command line utility: hadoop archive
.
How do I read the content of the HAR files in mapreduce or spark? Is there a FileInputFormat that can understand HAR file?
Follow the answer...here is simple pig-script in case someone else is interested:
A = LOAD 'har:///user/me/my.har/*.parquet'
USING parquet.pig.ParquetLoader
('key:chararray')
;
Upvotes: 0
Views: 2398
Reputation: 191738
From Hadoop Archives and MapReduce
Using Hadoop Archives in MapReduce is as easy as specifying a different input filesystem than the default file system. If you have a hadoop archive stored in HDFS in
/user/zoo/foo.har
then for using this archive for MapReduce input, all you need to specify the input directory ashar:///user/zoo/foo.har
. Since Hadoop Archives is exposed as a file system MapReduce will be able to use all the logical input files in Hadoop Archives as input.
So, you should be able to use whatever FileInputFormat you would use to read an HDFS directory of the same files.
Upvotes: 2