Quick(er) way to check whether a word is english by comparing it against a whitelist of english words?

Question

I am trying to eliminate all non-english words from many (100k) preprocssed text files (porter stemmed and lowercased, dropped all non a-z characters). I already parallelized the process to speed things up, but it is still painfully slow. Is there any more efficient way to do this in python?

englishwords = list(set(nltk.corpus.words.words()))
englishwords = [x.lower() for x in list(englishwords)]
englishwords = [ps.stem(w) for w in englishwords]
# this step takes too long:
shareholderletter= ' '.join(w for w in nltk.wordpunct_tokenize(shareholderletter) if w in englishwords)

Patrick Artner · Accepted Answer

You are checking for somthing in otherthing - and your otherthing is a list.

Lists are good for storing stuff, but lookup of "does x is in" your list takes O(n).

Use a set instead, that drops the lookup to O(1) and it eleminate any dupes so your base -size of things to look up in drops as well if you got duplicates.

If your set does not change afterwards, go and use a frozenset - which is immutable.

Read: Documentation of sets

If you use follow @DeepSpace 's suggestion, and leverage set operations you get even better performance:

s = set( t.lower().strip() for t in ["Some","text","in","set"])

t = set("Some text in a string that holds other words as well".lower().split())

print ( s&t )  # show me all things that are in both sets (aka intersection)

Output:

set(['text', 'some', 'in'])

See set operations

O(n): worstcase: your word is the last of 200k words in your list and you check the whole list - which takes 200k checks.

O(1): lookup time is constant, no matter how many items are in your datastructure, it takes the same amount of time to check if its in. To get this benefit, a set has a more complex storage-solution that takes slightly more memory (then a list) to perform so well on lookups.

Edit: worst case scenario for not finding a word inside a set/list:

import timeit

setupcode = """# list with some dupes
l = [str(i) for i in range(10000)] + [str(i) for i in range(10000)] + [str(i) for i in range(10000)]
# set of this list
s = set( l )
"""

print(timeit.timeit("""k = "10000" in l """,setup = setupcode, number=100))
print(timeit.timeit("""k = "10000" in s """,setup = setupcode, number=100))

0.03919574100000034    # checking 100 times if "10000" is in the list
0.00000512200000457    # checking 100 times if "10000" us in the set

Quick(er) way to check whether a word is english by comparing it against a whitelist of english words?

Answers (1)

Related Questions