Skorpius
Skorpius

Reputation: 2255

Code works on...smaller tar files but not bigger ones?

So I'm working on script to search through tar files for specific strings- basically zgrep. For some reason, though it freezes up on much larger files...

Any ideas?

#!/bin/bash

tarname=$1
pattern=$2
max=$3

count=1
tar -tf  $tarname | while read -r FILE
do
    tar -xf  $tarname $FILE

    count=$(expr $count + 1)

    if [ "$count" == "$max" ]; then
        rm $FILE
        break
    fi

    if grep $pattern $FILE; then
        echo "found pattern in :" $FILE
        mv $FILE stringfind
    else
        rm $FILE
    fi

done
if [ $(ls stringfind | wc -l) -eq 0 ]; then
    echo "File Not Found"
fi

I need it done this way to reduce spatial limitations- but why exactly is it not going through to other files? i did a loop print out test and it only looped once or twice before stopping...

So it's reading through the entire tar file every time i call "read"? As in- if a tar has 100 files, it's reading 100x100 = 10,000 times?

Upvotes: 1

Views: 82

Answers (3)

tripleee
tripleee

Reputation: 189678

You keep on opening and closing the tarfile, reading it from the beginning each time. It would be much more economical to just extract all the files in one go, if you can.

If you can't, moving to a language with library support for tar files would be my suggestion. https://docs.python.org/2/library/tarfile.html looks like what you need should be doable in just a few lines of Python.

Upvotes: 2

David W.
David W.

Reputation: 107080

You are reading in each file from the command line, then running tar -xf on that file multiple times. This is fairly inefficient. Just extract the whole tarball, then use grep -l -R (which works on most systems) to search for the files that contain the strings. The -l means list the file name and don't give me the line in the file that contains the regex.

Why on small ones and not large ones? Could be this logic:

if [ "$count" == "$max" ]; then
    rm $FILE
    break
fi

You're counting the number of times you're in the loop, and break when you hit max. If max is 100, this will fail on tar balls that contain 1000 files and the string is in the 200th file.

Upvotes: 1

Jan
Jan

Reputation: 96

  1. You extract each file by itself.
    • Without the "-n" parameter tar believes that the file is not seekable
    • This causes tar to read the whole archive from the beginning, even if you want to handle only the last file
  2. You should fist increment count and check for the break condition before extracting the last (obviously unneeded) file
  3. Since you don't seem to evaluate the content of the found file except for the test if the file "stringfind" exists you could just break after finding the first such file

Upvotes: 1

Related Questions