Sathya Magesh Kumar
Sathya Magesh Kumar

Reputation: 95

How to load multiple files in tar.gz into Pig

Scenario: Vendor will provide raw feed in tar.gz format which contains multiple files in tab delimited format File Detail: a) One Hit level data b) Multiple Lookup files c) One Header file for (a)

The feed(tar.gz) will be ingested and landed into BDP operational raw.

Query: Would like to load these data from operational raw area into Pig for data quality checking process. How this can be achieved? Should the files be extracted in hadoop for us to use or alternatives available? Please advise. Thanks! Note: Any sample script will be more helpful

Upvotes: 0

Views: 2172

Answers (1)

Murali Rao
Murali Rao

Reputation: 2287

Ref : http://pig.apache.org/docs/r0.9.1/func.html#load-store-functions

Extract from Docs :

Handling Compression

Support for compression is determined by the load/store function. PigStorage and TextLoader support gzip and bzip compression for both read (load) and write (store). BinStorage does not support compression.

To work with gzip compressed files, input/output files need to have a .gz extension. Gzipped files cannot be split across multiple maps; this means that the number of maps created is equal to the number of part files in the input location.

A = load 'myinput.gz'; 
store A into 'myoutput.gz'; 

Upvotes: 1

Related Questions