Reputation: 21
I am trying to write an algorithm to find a phrase with words on different lines in a big text file using Python.
The file contents are as follows
fkerghiohgeoihhgergerig ooetbjoptj
enbotobjeob hi how
are you lerjgoegjepogjejgpgrg]
ekrngeigoieghetghehtigehtgiethg
ieogetigheihietipgietigeitgegitie
.......
The algorithm should search for the phrase "hi how are you" and return True in this case. Since, the file can be huge, all file contents cannot be read at once
Upvotes: 1
Views: 374
Reputation: 77337
You can read the file one character at a time and change line feeds to spaces. Then its just a question of running down the list of wanted characters.
def find_words(text, fileobj):
i = 0
while True:
c = fileobj.read(1)
if not c:
break
if c == "\n": # python combines \r\n
c = " "
if c != text[i]:
i = 0
if c == text[i]:
i += 1
if i == len(text):
return True
return False
If you want to be a little more liberal about whitespace and case sensitivity, you could remove all whitespace and lower case everything before the compare.
import re
import itertools
from string import whitespace
def find_words(text, fileobj):
chars = list(itertools.chain.from_iterable(re.split(r"\s+", text.lower())))
i = 0
while True:
c = fileobj.read(1)
if not c:
break
c = c.lower()
if c in whitespace:
continue
if c != chars[i]:
i = 0
if c == chars[i]:
i += 1
if i == len(chars):
return True
return False
Upvotes: 1
Reputation: 71689
Here is one way to solve the problem:
import re
def find_phrase():
phrase = "hi how are you"
words = dict(zip(phrase.split(), [False]*len(phrase.split())))
with open("data.txt", "r") as f:
for line in f:
for word in words:
if re.search( r"\b" + word + r"\b", line):
words[word] = True
if all(words.values()):
return True
return False
EDIT:
def find_phrase():
phrase = "hi how are you"
with open("data.txt", "r") as f:
for line in f:
if phrase in line:
return True
return False
Upvotes: 1
Reputation: 7744
If it is "pretty large" file, then access the lines sequentially and don't read the whole file into memory:
with open('largeFile', 'r') as inF:
for line in inF:
if 'myString' in line:
# do_something
break
Edit:
Since the words of the string can be on consecutive lines you would want to use a counter to keep a track of words iterated. For example,
counter = 0
words_list = ["hi","hello","how"]
with open('largeFile', 'r') as inF:
for line in inF:
# print( words_list[counter] ,line)
if words_list[counter] in line and len(line.split()) == 1 :
counter +=1
else:
counter = 0
if counter == len(words_list):
print ("here")
break;
Text File
fkerghiohgeoihhgergerig ooetbjoptj enbotobjeob
hi
hello
how
goegjepogjejgpgrg] ekrngeigoieghetghehtigehtgiethg ieoge
It gives the output here
since the consecutive words are found
Upvotes: 0