Search large tar.gz file for keywords,copy and delete

Question

What is the best way with large log tar.gz files , some are 20 gig , to open and search for a keyword, copy found files to a directory , then delete the file so it doesn't consume disk space. I have some code below, it was working but then it stopped extracting files suddenly for some reason. If i remove the -O option from tar it extracts files again.

mkdir -p found;
tar tf "$1" | while read -r FILE
do
    if tar xf "$1" "$FILE" -O  | grep -l "$2" ;then
        echo "found pattern in : $FILE";
        cp $FILE found/$(basename $FILE);
        rm -f $FILE;
    fi
done

$1 is the tar.gz file , $2 is the keyword

UPDATE

Im doing the below which works, but a small file i have has 2million plus compressed files, so will take hours to look at all the files.Is there a python solution or similar that can do it faster.

#!/bin/sh
# tarmatch.sh
if grep -l "$1" ; then 
  echo  "Found keyword in ${TAR_FILENAME}";
  tar -zxvf "$2" "${TAR_FILENAME}" 
else
  echo "Not found in ${TAR_FILENAME}";
fi
true

tar -zxf 20130619.tar.gz --to-command "./tarmatch.sh '@gmail' 20130619.tar.gz "

UPDATE 2

Im using python now and seems to of increased in speed, was doing about 4000 records a second while the bash version was doing about 5.Im not that strong in python so probably this code could be optimized, please let me know if this could be optimized.

import tarfile
import time
import os
import ntpath, sys

if len(sys.argv) < 3 :
  print "Please provide the tar.gz file and keyword to search on"
  print "USAGE: tarfind.py example.tar.gz keyword"
  sys.exit() 

t = tarfile.open(sys.argv[1], 'r:gz')
cnt = 0;
foundCnt = 0;
now = time.time()
directory = 'found/'
if not os.path.exists(directory):
    os.makedirs(directory)

for tar_info in t:
    cnt+=1;
    if (tar_info.isdir()): continue
    if(cnt%1000 == 0): print "Processed " + str(cnt) + " files"
    f=t.extractfile(tar_info)
    if sys.argv[2] in f.read():
      foundCnt +=1
      newFile = open(directory + ntpath.basename(tar_info.name), 'w');
      f.seek(0,0)
      newFile.write( f.read() )
      newFile.close()
      print "found in file " + tar_info.name

future = time.time()
timeTaken = future-now

print "Found " + str(foundCnt) + " records"
print "Time taken " + str( int( timeTaken/60) ) + " mins " + str(int(timeTaken%60)) + " seconds"
print  str( int(cnt / timeTaken)) + " records per second"
t.close()

SSaikia_JtheRocker · Accepted Answer

If you are trying to search for a keyword in the files and extract only those, and since your file sizes are huge, it might take time if the keyword is somewhere at the middle.

The best advice I can give is probably use a powerful combination of a Inverted index lookup tool such as Solr(based on Lucene Indes) and Apache Tika - a content analysis toolkit.

Using these tools you can index the tar.gz files and when you search for a keyword, relevant documents containig the keyword will be returned.

Search large tar.gz file for keywords,copy and delete

Answers (2)

Related Questions