Which FileInputFormat to use to read Hadoop Archive Files (HAR) files

Question

I have created a har-file using the command line utility: hadoop archive.

How do I read the content of the HAR files in mapreduce or spark? Is there a FileInputFormat that can understand HAR file?

Follow the answer...here is simple pig-script in case someone else is interested:

A =     LOAD 'har:///user/me/my.har/*.parquet'  
        USING parquet.pig.ParquetLoader 
        ('key:chararray')
        ;

OneCricketeer · Accepted Answer

From Hadoop Archives and MapReduce

Using Hadoop Archives in MapReduce is as easy as specifying a different input filesystem than the default file system. If you have a hadoop archive stored in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input, all you need to specify the input directory as har:///user/zoo/foo.har. Since Hadoop Archives is exposed as a file system MapReduce will be able to use all the logical input files in Hadoop Archives as input.

So, you should be able to use whatever FileInputFormat you would use to read an HDFS directory of the same files.

Which FileInputFormat to use to read Hadoop Archive Files (HAR) files

Answers (1)

Related Questions