Reputation: 86
#read in csv file in form ("case, num, val \n case1, 1, baz\n...")
# convert to form FOO = "casenumval..." roughly 6 million characters
for someString in List: #60,000 substrings
if substr not in FOO:
#do stuff
else:
#do other stuff
So my issue is that there are far too many sub strings to check against this massive string. I have tried reading the file in line by line and checking the substrings against the line, but this still crashes the program. Are there any techniques for checking a lot of substrings againsts a very large string efficiently?
FOR CONTEXT: I am performing data checks, suspect data is saved to a csv file to be reviewed/changed. This reviewed/changed file is then compared to the original file. Data which has not changed has been verified as good and must be saved to a new "exceptionFile". Data that has been changed and passes is disregarded. And data which has been changed and is checked and still suspect is the sent off for review again.
Upvotes: 1
Views: 171
Reputation: 140886
The first thing you should do is convert your list of 60,000 strings to search for into one big regular expression:
import re
searcher = re.compile("|".join(re.escape(s) for s in List)
Now you can search for them all at once:
for m in searcher.finditer(FOO):
print(m.group(0)) # prints the substring that matched
If all you care about is knowing which ones were found,
print(set(m.group(0) for m in searcher.finditer(FOO))
This is still doing substantially more work than the absolute minimum, but it should be much more efficient than what you were doing before.
Also, if you know that your input is a CSV file and you also know that none of the strings-to-search-for contain a newline, you can operate line by line, which may or may not be faster than what you were doing depending on conditions, but will certainly use less memory:
with open("foo.csv") as FOO:
for line in FOO:
for m in searcher.finditer(line):
# do something with the substring that matched
Upvotes: 2