JayGatsby
JayGatsby

Reputation: 1621

reading strings from large file faster

I have a large text file (parsed.txt) which includes almost 1.500.000 lines. Each line is in this format:

foobar foo[Noun]+lAr[A3pl]+[Pnon]+[Nom]
loremipsum lorem[A1sg]+lAr[A3pl]+[Pl]+[Nom]

I'm giving the second field after space and get the first field before space with this function:

def find_postag(word,postag):
    with open('parsed.txt',"r") as zemberek:    
        for line in zemberek:
            if all(i in line for i in (word,postag)):
                if line.split(" ")[0].startswith(word) and line.split(" ")[1] == word+postag:
                    selectedword = line.split(" ")[0]
                    break
        return selectedword

However, it works too slow. I'm not sure how can I make the process faster. My idea is: The parsed.txt file is alphabetic ordered. If given word variable starts with "z" letter, it reads almost 900.000 lines unnecessarily. Maybe it will be faster if it will check from line 900.000 if the given word starts with "z" letter. Is there any better ideas and how can I implement?

Upvotes: 0

Views: 74

Answers (1)

vesche
vesche

Reputation: 1860

Since your input file is alphabetical, what you could do is create a dictionary that contains the line number where each letter starts, like this:

with open('parsed.txt', 'r') as f:
    data = [line.strip() for line in f if line.strip()]

index = dict()
for i in range(len(data)):
    line = data[i]
    first_letter = line[0].lower()
    if first_letter not in index:
        index[first_letter] = i

You would want to add that code at the beginning so it only runs once before you start doing the searches. This way when you search for a word, you can have it start searching where its first letter starts, like this:

def find_postag(word, postag):
    start = index[word[0].lower()]
    for line in data[start:]:
        # your code here
        if all(i in line for i in (word,postag)):
            if line.split(" ")[0].startswith(word) and line.split(" ")[1] == word+postag:
                selectedword = line.split(" ")[0]
                break
    return selectedword

Upvotes: 1

Related Questions