alexjones
alexjones

Reputation: 86

Finding sub-strings in LARGE string

#read in csv file in form ("case, num, val \n case1, 1, baz\n...")
# convert to form FOO = "casenumval..." roughly 6 million characters
for someString in List: #60,000 substrings
    if substr not in FOO:
        #do stuff
    else: 
        #do other stuff

So my issue is that there are far too many sub strings to check against this massive string. I have tried reading the file in line by line and checking the substrings against the line, but this still crashes the program. Are there any techniques for checking a lot of substrings againsts a very large string efficiently?

FOR CONTEXT: I am performing data checks, suspect data is saved to a csv file to be reviewed/changed. This reviewed/changed file is then compared to the original file. Data which has not changed has been verified as good and must be saved to a new "exceptionFile". Data that has been changed and passes is disregarded. And data which has been changed and is checked and still suspect is the sent off for review again.

Upvotes: 1

Views: 171

Answers (1)

zwol
zwol

Reputation: 140886

The first thing you should do is convert your list of 60,000 strings to search for into one big regular expression:

import re
searcher = re.compile("|".join(re.escape(s) for s in List)

Now you can search for them all at once:

for m in searcher.finditer(FOO):
    print(m.group(0))  # prints the substring that matched

If all you care about is knowing which ones were found,

print(set(m.group(0) for m in searcher.finditer(FOO))

This is still doing substantially more work than the absolute minimum, but it should be much more efficient than what you were doing before.

Also, if you know that your input is a CSV file and you also know that none of the strings-to-search-for contain a newline, you can operate line by line, which may or may not be faster than what you were doing depending on conditions, but will certainly use less memory:

with open("foo.csv") as FOO:
    for line in FOO:
        for m in searcher.finditer(line):
            # do something with the substring that matched

Upvotes: 2

Related Questions