ronex dicapriyo
ronex dicapriyo

Reputation: 161

File handling in python for large amount of data

I am looking for some efficient way to read from a large file and grep a string 'abracadabra' (=s1), here s1 is always present before eof(end of file) And if s1 is present search for another string 'P QRST'

i.e. Data of File f1:

ABCD EFGH IJKL MNOP
QRST UVW XY Z
...
...
abracadabra(eof)

I am novice for python, to do this I was using following method:

f = open('f1', 'r')
for line in f:
   if 'abracadabra' in line:
      print 'Found'
      found = True
      break
else:
   print 'EOF reached, string not found'        
f.close()

f = open('f1', 'r')
for line in f:
   if 'P QRST' in line:             # 'P QRST' can be present anywhere in the file(can be either in begging or just before s1)
      #do some operation

But this approach involves huge overhead for a large file data, as it read the files twice. Please suggest some efficient approach to do this.

Is there any efficient way like a 'grep' application in Linux shell?

Upvotes: 0

Views: 253

Answers (3)

Roland Smith
Roland Smith

Reputation: 43495

If grep already does what you want, then just use it with the subprocess module;

rv = subprocess.check_call(['grep', first, filename])
if rv is 0:
    rv = subprocess.check_call(['grep', second, filename])
    if rv is 0:
        print 'Found', second
    elif rv is 1:
        print second, 'not found'
    else:
        print 'Error looking for', second
elif rv is 1:
    print first, 'not found'
else:
    print 'Error looking for', first

Edit: (in response to comments)

Remember that the most efficient code is the one you don't have to write, since programmer time is much more expensive than computer time. Also keep in mind that grep has been well optimized. One of the big wins was not splitting the input into lines. Which your Python solution is doing twice.

At the very least you should keep the list of lines and re-use it.

But if you want to do it in Python, there are some things that you can copy from GNU grep;

  • Use page-aligned, page-sized blocks of memory
  • Avoid splitting the data into lines
  • Use mmap

Edit 2:

As Spacedman correctly mentions, there is no substitute for testing. So let's do that.

I ran some tests on a 40 MB test file, searching for the non-existant string abracadabra. First, using BSD grep;

> time grep abracadabra procmail.log
0.242u 0.015s 0:00.25 100.0%    58+179k 0+0io 0pf+0w
> time grep abracadabra procmail.log
0.192u 0.016s 0:00.20 100.0%    57+176k 0+0io 0pf+0w
> time grep abracadabra procmail.log
0.184u 0.023s 0:00.20 100.0%    59+183k 0+0io 0pf+0w
> time grep abracadabra procmail.log
0.199u 0.007s 0:00.20 95.0% 60+186k 0+0io 0pf+0w
> time grep abracadabra procmail.log
0.184u 0.023s 0:00.20 100.0%    59+183k 0+0io 0pf+0w
> time grep abracadabra procmail.log
0.184u 0.024s 0:00.20 100.0%    57+176k 0+0io 0pf+0w

Then with the following program:

import mmap

with open('procmail.log', 'r+b') as p:
    mm = mmap.mmap(p.fileno(), 0)
    rv = mm.find('abracadabra')
    print rv

This gave:

> time python foo.py
-1
0.139u 0.024s 0:00.16 93.7% 1701+549k 0+1io 0pf+0w
> time python foo.py
-1
0.094u 0.039s 0:00.13 92.3% 1807+583k 0+1io 0pf+0w
> time python foo.py
-1
0.109u 0.023s 0:00.13 92.3% 1807+583k 0+1io 0pf+0w
> time python foo.py
-1
0.117u 0.015s 0:00.13 92.3% 1807+583k 0+1io 0pf+0w
> time python foo.py
-1
0.125u 0.007s 0:00.13 92.3% 1807+583k 0+1io 0pf+0w

So;

  • on my machine
  • searching for a simple string

using mmap in Python is slightly faster than calling BSD grep.

Keep in mind that the mmap objects (at least in python 2.7) do not support searching for regular expressions. And results may differ with file size, size of the RAM, operating system et cetera.

Upvotes: 2

logc
logc

Reputation: 3923

You can use the seek method of the file handle to rewind from the end of the file, by using a 2 as its second argument. Then you can check for the abracadabra string at the end of the file, and if found, scan the whole file (from the beginning) for the P QRST string:

with open('sample.txt', 'r') as fh:
    s1 = 'abracadabra'
    fh.seek(-len(s1), 2)
    if fh.read() == s1:
        fh.seek(0)
        for line in fh:
            if 'P QRST' in line:
                print 'Bingo'

Upvotes: 0

Spacedman
Spacedman

Reputation: 94172

If you find the abracadabra, set a flag, if you find the PQRST, set a flag, if you've found both, do something.

found = False
foundPQRST = False
f = open('f1', 'r')
for line in f:
   if 'P QRST' in line:
      foundPQRST= True
      PQRSTline = line   
   if 'abracadabra' in line:
      print 'Found'
      abraline = line
      found = True
   if found and foundPQRST:
      dosomething(PQRSTline, abraline)
      break
if not found:
   print "wasn't found"
f.close()

If there are multiple PQRST-matching lines, this will call dosomething with the last one before 'abracadabra' or the first one after. Not how it saves the PQRST line and the abracadabra line when it finds matches to use it in the call. You might not need to do this, depending on what dosomething does.

Also, you might want to use a with clause with the open function.

Upvotes: 1

Related Questions