Reputation: 1621
I have a large text file (parsed.txt
) which includes almost 1.500.000 lines. Each line is in this format:
foobar foo[Noun]+lAr[A3pl]+[Pnon]+[Nom]
loremipsum lorem[A1sg]+lAr[A3pl]+[Pl]+[Nom]
I'm giving the second field after space and get the first field before space with this function:
def find_postag(word,postag):
with open('parsed.txt',"r") as zemberek:
for line in zemberek:
if all(i in line for i in (word,postag)):
if line.split(" ")[0].startswith(word) and line.split(" ")[1] == word+postag:
selectedword = line.split(" ")[0]
break
return selectedword
However, it works too slow. I'm not sure how can I make the process faster. My idea is: The parsed.txt
file is alphabetic ordered. If given word
variable starts with "z" letter, it reads almost 900.000 lines unnecessarily. Maybe it will be faster if it will check from line 900.000 if the given word
starts with "z" letter. Is there any better ideas and how can I implement?
Upvotes: 0
Views: 74
Reputation: 1860
Since your input file is alphabetical, what you could do is create a dictionary that contains the line number where each letter starts, like this:
with open('parsed.txt', 'r') as f:
data = [line.strip() for line in f if line.strip()]
index = dict()
for i in range(len(data)):
line = data[i]
first_letter = line[0].lower()
if first_letter not in index:
index[first_letter] = i
You would want to add that code at the beginning so it only runs once before you start doing the searches. This way when you search for a word, you can have it start searching where its first letter starts, like this:
def find_postag(word, postag):
start = index[word[0].lower()]
for line in data[start:]:
# your code here
if all(i in line for i in (word,postag)):
if line.split(" ")[0].startswith(word) and line.split(" ")[1] == word+postag:
selectedword = line.split(" ")[0]
break
return selectedword
Upvotes: 1