Reputation: 10431
I have a large.tar.gz
file containing about 1 million files, out of which about 1/4 of them are html files, and I want to parse a few lines of each of the html files within.
I want to avoid having to extract the contents of large large.tar.gz
into a folder and then parse the html files, instead I would like to know how can I pipe the contents of the html files in the large.tar.gz
straight to STDOUT
so that I can grep/parse out the information I want from them?
I presume there must be some magic like:
tar -special_flags large.tar.gz | grep_only_files_with_extension html | xargs -n1 head -n 99999 | ./parse_contents.pl -
Any ideas?
Upvotes: 21
Views: 23460
Reputation: 88563
Use this with GNU tar to extract a tgz to stdout:
tar -xOzf large.tar.gz --wildcards '*.html' | grep ...
-O, --to-stdout
: extract files to standard output
Upvotes: 49