Reputation: 1645
What is the best way with large log tar.gz files , some are 20 gig , to open and search for a keyword, copy found files to a directory , then delete the file so it doesn't consume disk space. I have some code below, it was working but then it stopped extracting files suddenly for some reason. If i remove the -O option from tar it extracts files again.
mkdir -p found;
tar tf "$1" | while read -r FILE
do
if tar xf "$1" "$FILE" -O | grep -l "$2" ;then
echo "found pattern in : $FILE";
cp $FILE found/$(basename $FILE);
rm -f $FILE;
fi
done
$1 is the tar.gz file , $2 is the keyword
UPDATE
Im doing the below which works, but a small file i have has 2million plus compressed files, so will take hours to look at all the files.Is there a python solution or similar that can do it faster.
#!/bin/sh
# tarmatch.sh
if grep -l "$1" ; then
echo "Found keyword in ${TAR_FILENAME}";
tar -zxvf "$2" "${TAR_FILENAME}"
else
echo "Not found in ${TAR_FILENAME}";
fi
true
tar -zxf 20130619.tar.gz --to-command "./tarmatch.sh '@gmail' 20130619.tar.gz "
UPDATE 2
Im using python now and seems to of increased in speed, was doing about 4000 records a second while the bash version was doing about 5.Im not that strong in python so probably this code could be optimized, please let me know if this could be optimized.
import tarfile
import time
import os
import ntpath, sys
if len(sys.argv) < 3 :
print "Please provide the tar.gz file and keyword to search on"
print "USAGE: tarfind.py example.tar.gz keyword"
sys.exit()
t = tarfile.open(sys.argv[1], 'r:gz')
cnt = 0;
foundCnt = 0;
now = time.time()
directory = 'found/'
if not os.path.exists(directory):
os.makedirs(directory)
for tar_info in t:
cnt+=1;
if (tar_info.isdir()): continue
if(cnt%1000 == 0): print "Processed " + str(cnt) + " files"
f=t.extractfile(tar_info)
if sys.argv[2] in f.read():
foundCnt +=1
newFile = open(directory + ntpath.basename(tar_info.name), 'w');
f.seek(0,0)
newFile.write( f.read() )
newFile.close()
print "found in file " + tar_info.name
future = time.time()
timeTaken = future-now
print "Found " + str(foundCnt) + " records"
print "Time taken " + str( int( timeTaken/60) ) + " mins " + str(int(timeTaken%60)) + " seconds"
print str( int(cnt / timeTaken)) + " records per second"
t.close()
Upvotes: 1
Views: 2648
Reputation: 5063
If you are trying to search for a keyword in the files and extract only those, and since your file sizes are huge, it might take time if the keyword is somewhere at the middle.
The best advice I can give is probably use a powerful combination of a Inverted index lookup tool such as Solr(based on Lucene Indes) and Apache Tika - a content analysis toolkit.
Using these tools you can index the tar.gz files and when you search for a keyword, relevant documents containig the keyword will be returned.
Upvotes: 1
Reputation: 158100
If the file is really 20GB it will take very long to grep in any case. The only advice I can give is to use zgrep
. This will save you from having to explicitly uncompress the archive.
zgrep PATTERN your.tgz
Upvotes: 1