Ranjan P
Ranjan P

Reputation: 21

How to find a phrase in a large text file in Python?

I am trying to write an algorithm to find a phrase with words on different lines in a big text file using Python.

The file contents are as follows

fkerghiohgeoihhgergerig ooetbjoptj
enbotobjeob hi how
are you lerjgoegjepogjejgpgrg]
ekrngeigoieghetghehtigehtgiethg
ieogetigheihietipgietigeitgegitie
.......

The algorithm should search for the phrase "hi how are you" and return True in this case. Since, the file can be huge, all file contents cannot be read at once

Upvotes: 1

Views: 374

Answers (3)

tdelaney
tdelaney

Reputation: 77337

You can read the file one character at a time and change line feeds to spaces. Then its just a question of running down the list of wanted characters.

def find_words(text, fileobj):
    i = 0
    while True:
        c = fileobj.read(1)
        if not c:
           break
        if c == "\n": # python combines \r\n
            c = " "
        if c != text[i]:
            i = 0
        if c == text[i]:
            i += 1
            if i == len(text):
               return True
    return False

If you want to be a little more liberal about whitespace and case sensitivity, you could remove all whitespace and lower case everything before the compare.

import re
import itertools
from string import whitespace

def find_words(text, fileobj):
    chars = list(itertools.chain.from_iterable(re.split(r"\s+", text.lower())))
    i = 0
    while True:
        c = fileobj.read(1)
        if not c:
            break
        c = c.lower()
        if c in whitespace:
            continue
        if c != chars[i]:
            i = 0
        if c == chars[i]:
            i += 1
            if i == len(chars):
               return True
    return False

Upvotes: 1

Shubham Sharma
Shubham Sharma

Reputation: 71689

Here is one way to solve the problem:

import re

def find_phrase():
    phrase = "hi how are you"
    words = dict(zip(phrase.split(), [False]*len(phrase.split())))
    with open("data.txt", "r") as f:
        for line in f:
            for word in words:
                if re.search( r"\b" + word + r"\b", line):
                    words[word] = True

                if all(words.values()):
                    return True
    return False

EDIT:

def find_phrase():
    phrase = "hi how are you"
    with open("data.txt", "r") as f:
        for line in f:
            if phrase in line:
                return True
    return False

Upvotes: 1

AzyCrw4282
AzyCrw4282

Reputation: 7744

If it is "pretty large" file, then access the lines sequentially and don't read the whole file into memory:

with open('largeFile', 'r') as inF:
    for line in inF:
        if 'myString' in line:
            # do_something
            break

Edit:

Since the words of the string can be on consecutive lines you would want to use a counter to keep a track of words iterated. For example,

counter = 0
words_list = ["hi","hello","how"]
with open('largeFile', 'r') as inF:
    for line in inF:
        # print( words_list[counter] ,line)
        if words_list[counter] in line and len(line.split()) == 1 :
            counter +=1
        else:
            counter = 0
        if counter == len(words_list):
            print ("here")
            break;

Text File

fkerghiohgeoihhgergerig ooetbjoptj enbotobjeob
hi
hello
how
goegjepogjejgpgrg] ekrngeigoieghetghehtigehtgiethg ieoge

It gives the output here since the consecutive words are found

Upvotes: 0

Related Questions