Hurt Cobain
Hurt Cobain

Reputation: 1

Comparing words from file against a list much too slow

I am looking to write a function that takes a list of words (wordlist), opens a txt file and returns a list of words that don't appear in the txt file. This is what I have so far...

def check_words_in_file(wordlist):
    """Return a list of words that don't appear in words.txt"""
    words = set()
    words = open("words.txt").read().splitlines()

    return [x for x in wordlist if x not in words]

The problem I am having with this function is that it is too slow. If I use a wordlist consisting of say 10,000 words, it takes about 15 seconds to complete. If I use one with 300,000 it takes way longer than it should. Is there anyway I can make this function faster?

Upvotes: 0

Views: 79

Answers (1)

Abhijit
Abhijit

Reputation: 63707

The problem is with your understanding of usage of variables and associating with objects, which is evident when you write

words = set()
words = open("words.txt").read().splitlines()

In the first line, you initially create an empty set object and associate the reference of it with the variable words. Later you open the file and split the lines of it content, which returns a list and rebind the variable words with the list

You probably intended to write

words = set(open("words.txt").read().splitlines())

Further improvement

You can actually do better, if you create a set of the arguments wordlist and find an asymmetric set difference

words = set(wordlist).difference(open("words.txt").read().splitlines())
return list(words)

Nitpick

It is generally not advised to open a file and let the file handle be garbage collected. Either close the file or use a context manager

with open("words.txt") as fin:
    from itertools import imap
    words = set(wordlist).difference(imap(str.strip, fin))
    return list(words)

Upvotes: 7

Related Questions