Jon Martin
Jon Martin

Reputation: 3392

Python efficient way to check if very large string contains a substring

Python is not my best language, and so I'm not all that good at finding the most efficient solutions to some of my problems. I have a very large string (coming from a 30 MB file) and I need to check if that file contains a smaller substring (this string is only a few dozen characters). The way I am currently doing it is:

if small_string in large_string:
    # logic here

But this seems to be very inefficient because it will check every possible sequence of characters in the file. I know that there will only be an exact match on a newline, so would it be better to read the file in as a list and iterate through that list to match?

EDIT: To clear up some confusion on "matching on a newline only", here's an example:

small_string = "This is a line"
big_string = "This is a line\nThis is another line\nThis is yet another"

If I'm not mistake, the in keyword will check all sequences, not just every line.

Upvotes: 12

Views: 30836

Answers (8)

Mahmoud Youssef
Mahmoud Youssef

Reputation: 23

small_string = "This is a line"
big_string = "This is a line This is another line\nThis is yet another"

test= big_string.split("This is a line" ,1)

if len(test)==2:

    print "it`s there"

elif len(test)!=2:

    print "it`s not"

Upvotes: 0

Matt Warren
Matt Warren

Reputation: 686

Are you implying that only a complete line will match? (your EDIT: matching on a newline only example seems to)

Then I imagine

for line in open('file').readlines():
  if line==small_string:
    return True
return False

IE, using == is quicker than 'in' - perhaps. I wouldn't be surprised if the underlying implementation of in catches the case where the line to search and the string to search for are the same length and just attempts an == itself.

woudl be better.

Upvotes: 1

Michał Bentkowski
Michał Bentkowski

Reputation: 2145

Is it really slow? You're talking about 30MB string; let's try it with even bigger string:

In [12]: string="agu82934u"*50*1024*1024+"string to be found"

In [13]: len(string)
Out[13]: 471859218

In [14]: %timeit "string to be found" in string
1 loops, best of 3: 335 ms per loop

In [15]: %timeit "string not to be found" in string
1 loops, best of 3: 200 ms per loop

I wouldn't say that 335 ms is much time looking for substring in 450MB string.

Upvotes: 19

rplnt
rplnt

Reputation: 2409

I would rely on fast implementation by someone else:

import subprocess
from subprocess import STDOUT
import os

...
with open(os.devnull, 'w') as devnull:
    if subprocess.call('grep %s "%s"' % (smallstring, file), shell=True, stdout=devnull, stderr=STDOUT) == 0:
        pass #do stuff

Won't work on windows.

edit: I'm worried taht grep returns 0 wheter it finds something or not. But I don't have any shell available to me now so I can't test it.

Upvotes: -1

Marcelo Cantos
Marcelo Cantos

Reputation: 185852

How slow is too slow? I just did an a in b test for a unique string at the end of a 170 MB string. It finished before my finger left the Enter key.

Upvotes: 13

ghostdog74
ghostdog74

Reputation: 342373

If you just want to check if that substring exists,

for line in open("file"):
    if substring in line:
         print "exists"
         sys.exit() # or break to do some other stuff

Upvotes: 2

Pedro Loureiro
Pedro Loureiro

Reputation: 11514

You can use one of these algorithms:

Although I believe KMP is more efficient, it's more complicated to implement.The first link includes some pseudo-code that should make it very easy to implement in python.

you can look for alternatives here: http://en.wikipedia.org/wiki/String_searching_algorithm

Upvotes: 5

brandizzi
brandizzi

Reputation: 27050

I do not see how to make it more optimal on the comparison, to be honest. But you can use less memory and lose less time with I/O if you read the file line by line:

has_small_string = False
for line in large_file:
    if small_string in line:
        has_small_string = True
        break
if has_small_string:
   # ... Your stuff here ...

This is no revolutionary improvement, and can be even less useful if you really need the large string in the memory, but it may be helpful, so I am posting here :)

Upvotes: 4

Related Questions