Newtang
Newtang

Reputation: 6544

Handling tar headers in Pig

If I'm running Pig on a bunch of *.tar.gz files, PigStorage will handle unzipping fine, but the header lines between the files in tar aren't handled. Is there a simple way to handle this? Or do I have to write my own RecordReader? And what would this look like?

Upvotes: 3

Views: 385

Answers (1)

cabad
cabad

Reputation: 4575

You can use tar to clean up the headers on the fly. In your Pig script, do:

--Call to tar that reads from stdin and outputs to stdout
DEFINE CLEANTAR `tar xvf - -O`;

--Now, remove tar headers from your data
cleaned = STREAM mydata THROUGH CLEANTAR;

EDIT: Added the following alternative.

You can also remove the tar headers using sed:

--Remove tar headers using sed
DEFINE CLEANTAR `sed 's/[^\n]*\o000//g'`;

--Now, remove tar headers from your data
cleaned = STREAM mydata THROUGH CLEANTAR;

Upvotes: 5

Related Questions