Pavunkumar
Pavunkumar

Reputation: 5345

Performing grep operation in tar files without extracting

I have list of files which contain particular patterns, but those files have been tarred. Now I want to search for the pattern in the tar file, and to know which files contain the pattern without extracting the files.

Any idea...?

Upvotes: 50

Views: 82964

Answers (7)

mxmlnkn
mxmlnkn

Reputation: 2141

You can mount the TAR archive with ratarmount and then simply search for the pattern in the mounted view:

pip install --user ratarmount
ratarmount large-archive.tar mountpoint
grep -r '<pattern>' mountpoint/

This should be much faster than iterating over each file and printing it to stdout, especially for compressed TARs.


Here is a simple comparison benchmark:

function checkFilesWithRatarmount()
{
    local pattern=$1
    local archive=$2
    ratarmount "$archive" "$archive.mountpoint"
    'grep' -r -l "$pattern" "$archive.mountpoint/"
}

function checkEachFileViaStdOut()
{
    local pattern=$1
    local archive=$2
    tar --list --file "$archive" | while read -r file; do
        if tar -x --file "$archive" -O -- "$file" | grep -q "$pattern"; then
            echo "Found pattern in: $file"
        fi
    done
}

function createSampleTar()
{
    for i in $( seq 40 ); do 
        head -c $(( 1024 * 1024 )) /dev/urandom | base64 > $i.dat
    done
    tar -czf "$1" [0-9]*.dat
}

createSampleTar myarchive.tar.gz
time checkEachFileViaStdOut ABCD myarchive.tar.gz
time checkFilesWithRatarmount ABCD myarchive.tar.gz
sleep 0.5s
fusermount -u myarchive.tar.gz.mountpoint

Results in seconds for a 55 MiB uncompressed and 42 MiB compressed TAR archive containing 40 files:

Compression Ratarmount Bash Loop over tar -O
none 0.31 +- 0.01 0.55 +- 0.02
gzip 1.1 +- 0.1 13.5 +- 0.1
bzip2 1.2 +- 0.1 97.8 +- 0.2

Of course, these results are highly dependent on the archive size and how many files the archive contains. These test examples are pretty small because I didn't want to wait too long but they already show the problem. The more files there are, the longer it takes for tar -O to jump to the correct file. And for compressed archives, it will be quadratically slower the larger the archive size is because everything before the requested file has to be decompressed and each file is requested separately. Both of these problems are solved by ratarmount.

Upvotes: 0

Gavin S. Yancey
Gavin S. Yancey

Reputation: 1276

This can be done with tar --to-command and grep --label:

tar xaf archive.tar.gz --to-command 'egrep -Hn --label="$TAR_FILENAME" your_pattern_here || true'
  • --label gives grep the filename
  • -H tells grep to display the filename, and -n the line number
  • || true because otherwise grep will exit with an error if the pattern is not found, and tar will complain about that.
  • xaf means to extract, and automagically decompress based off the file extension
  • --to-command has tar pass each file in the tarfile to a separate invocation of grep, and sets various environment variables with info about the file. See the manpage for more info.

Pretty heavily based off of Chipaca's answer (and Daniel H's comment), but this should be a bit easier to use and just uses tar and grep.

Upvotes: 7

Robert Muil
Robert Muil

Reputation: 3085

The command zgrep should do exactly what you want, directly.

for example

zgrep "mypattern" *.gz

http://linux.about.com/library/cmd/blcmdl1_zgrep.htm

Upvotes: 44

Chipaca
Chipaca

Reputation: 387

GNU tar has --to-command. With it you can have tar pipe each file from the archive into the given command. For the case where you just want the lines that match, that command can be a simple grep. To know the filenames you need to take advantage of tar setting certain variables in the command's environment; for example,

tar xaf thing.tar.xz --to-command="awk -e '/thing.to.match/ {print ENVIRON[\"TAR_FILENAME\"] \":\", \$0}'"

Because I find myself using this often, I have this:

#!/bin/sh
set -eu

if [ $# -lt 2 ]; then
    echo "Usage: $(basename "$0") <pattern> <tarfile>"
    exit 1
fi

if [ -t 1 ]; then
    h="$(tput setf 4)"
    m="$(tput setf 5)"
    f="$(tput sgr0)"
else
    h=""
    m=""
    f=""
fi

tar xaf "$2" --to-command="awk -e '/$1/{gsub(\"$1\", \"$m&$f\"); print \"$h\" ENVIRON[\"TAR_FILENAME\"] \"$f:\", \$0}'"

Upvotes: 10

ghostdog74
ghostdog74

Reputation: 342649

the tar command has a -O switch to extract your files to standard output. So you can pipe those output to grep/awk

tar xvf  test.tar -O | awk '/pattern/{print}'

tar xvf  test.tar -O | grep "pattern"

eg to return file name one pattern found

tar tf myarchive.tar | while read -r FILE
do
    if tar xf test.tar $FILE  -O | grep "pattern" ;then
        echo "found pattern in : $FILE"
    fi
done

Upvotes: 43

Matthew Flaschen
Matthew Flaschen

Reputation: 284927

The easiest way is probably to use avfs. I've used this before for such tasks.

Basically, the syntax is:

avfsd ~/.avfs # Sets up a avfs virtual filesystem
rgrep pattern ~/.avfs/path/to/file.tar#/

/path/to/file.tar is the path to the actual tar file.

Pre-pending ~/.avfs/ (the mount point) and appending # lets avfs expose the tar file as a directory.

Upvotes: 2

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 799110

Python's tarfile module along with Tarfile.extractfile() will allow you to inspect the tarball's contents without extracting it to disk.

Upvotes: 2

Related Questions