Reputation: 6544
If I'm running Pig on a bunch of *.tar.gz files, PigStorage will handle unzipping fine, but the header lines between the files in tar aren't handled. Is there a simple way to handle this? Or do I have to write my own RecordReader? And what would this look like?
Upvotes: 3
Views: 385
Reputation: 4575
You can use tar to clean up the headers on the fly. In your Pig script, do:
--Call to tar that reads from stdin and outputs to stdout
DEFINE CLEANTAR `tar xvf - -O`;
--Now, remove tar headers from your data
cleaned = STREAM mydata THROUGH CLEANTAR;
EDIT: Added the following alternative.
You can also remove the tar headers using sed:
--Remove tar headers using sed
DEFINE CLEANTAR `sed 's/[^\n]*\o000//g'`;
--Now, remove tar headers from your data
cleaned = STREAM mydata THROUGH CLEANTAR;
Upvotes: 5