Reputation: 95
Scenario: Vendor will provide raw feed in tar.gz format which contains multiple files in tab delimited format File Detail: a) One Hit level data b) Multiple Lookup files c) One Header file for (a)
The feed(tar.gz) will be ingested and landed into BDP operational raw.
Query: Would like to load these data from operational raw area into Pig for data quality checking process. How this can be achieved? Should the files be extracted in hadoop for us to use or alternatives available? Please advise. Thanks! Note: Any sample script will be more helpful
Upvotes: 0
Views: 2172
Reputation: 2287
Ref : http://pig.apache.org/docs/r0.9.1/func.html#load-store-functions
Extract from Docs :
Handling Compression
Support for compression is determined by the load/store function. PigStorage and TextLoader support gzip and bzip compression for both read (load) and write (store). BinStorage does not support compression.
To work with gzip compressed files, input/output files need to have a .gz extension. Gzipped files cannot be split across multiple maps; this means that the number of maps created is equal to the number of part files in the input location.
A = load 'myinput.gz';
store A into 'myoutput.gz';
Upvotes: 1