aroooo
aroooo

Reputation: 5056

Check word against very large list

If I have a list of 10,000 words, what's an optimized way to check if a word is in that list that won't slow the app down to a crawl?

Should I load the words in from a file and check against that?

def check_for_word(word):
    HUGE_LIST = [...]  # 10,000 Words
    if word in HUGE_LIST:
         return True
    else:
         return False

Upvotes: 0

Views: 267

Answers (4)

gboffi
gboffi

Reputation: 25023

You may want to take a slightly indirect route, writing a function that, given a file containing all your words, returns a function to check membership

def make_checker(fname):
    with open(fname) as f:
        # Hp: one word per line, you can adjust the code for a different format
        # this line builds a set by a _set comprehension_
        words = {word.strip() for word in f}
    def the_checker(word):
        return word in words
    return the_checker

That you can use like this

check_4 = make_checker('corpus_of_4_letter_words.txt')
...
if check_4(answer.strip()):
    print('Please, please do not use these words.')

Upvotes: 0

Łukasz Rogalski
Łukasz Rogalski

Reputation: 23203

Convert a list to set - strings are hashable so set can be easily created.

Look-up in set is O(1), where for list it's O(n), where n is a length of list.

HUGE_SET = set(HUGE_LIST)   # or frozenset, if it's constant and words won't be added to it
return word in HUGE_SET

Also, consider moving creation of huge list and huge set outside of function body. Right now list is recreated every time function is called.

List timings:

$ python -m timeit -s "words = list(map(str, xrange(10000)))" -n 10000 "'5000' in words"
10000 loops, best of 3: 58.2 usec per loop

Frozenset timings:

$ python -m timeit -s "words = frozenset(map(str, xrange(10000)))" -n 10000 "'5000' in words"
10000 loops, best of 3: 0.0504 usec per loop

Upvotes: 4

RemcoGerlich
RemcoGerlich

Reputation: 31250

Read the words in from a file and turn them into a set of words. Checking membership of a set is very fast (and 10,000 is not "very large" :-)).

with open('words.txt') as words:
    wordset = {word.strip() for word in words}

return word in wordset

(though it would help if you didn't have to read it in each time, keep it around in a variable -- building the set each time takes longer than checking if a word is in it in your original way)

Upvotes: 1

Moinuddin Quadri
Moinuddin Quadri

Reputation: 48057

  • If you won't be doing any modification in the list, use tuple instead of list.

  • If the items in the list are unique, it is even better to use set then.

Lookup will be faster with tuple/set as compared to lookup in list

Upvotes: 2

Related Questions