avi
avi

Reputation: 9636

Comparing two large lists in python

I have one list which contains about 400 words. And another list of lists, in which each list contains about 150,000 words. This list has 20 such lists.

Now I want to see how many of these 400 words appear in all of these 150,000 words list. I also want to know a word from this 400 words, appear how many times in 150k words list, which of these words occur most, how many times etc.

Only solution I can think of is polynomial time solution. It is a very bad solution and will be hell lot slow:

for one_list in list_of_150kwords:
    for key in 400_words:
        for word in one_list:
            if key == word:
                # count this word
                # do other stuff

This is a very ugly and bad solution, but I can't think of any better. I tried the same with NumPy by converting these lists to NumPy arrays:

list_of_150kwords = numpy.array(list_of_150kwords)
...

But I still find it very slow. Any other solution? Or any library?

Upvotes: 17

Views: 10696

Answers (4)

roshan
roshan

Reputation: 1333

Classic Map Reduce Problem.... http://sist.sysu.edu.cn/~wuweig/DSCC/Inverted%20Indexing%20by%20MapReduce.pdf

Upvotes: 0

Usman Ismail
Usman Ismail

Reputation: 18679

This is the canonical example of a place where a Trie will be useful. You need to create a Trie for each of your 150K lists. Then you can check whether a given word exists in the list in O(W) time. where W is the max length of the word.

Then you can loop through the list of 400 words and check whether each work is in the 150K word list.

Given that L i.e. number of 150K lists is much smaller than 150K and W is much smaller than 150K no set join will ever be as fast as a Trie comparison.

The final running complexity:

N = 400 //Size of small list
W = 10 // Max word Length
M = 150K // Max size of the 150K lists
P = 4 // Number of 150K lists

P * M // To construct Trie
N * P * W // To find each word in the 150k lists
MP + NPW // Total complexit

Upvotes: 0

Óscar López
Óscar López

Reputation: 236124

This sounds like a good opportunity for using a set:

set_of_150kwords = set(list_of_150kwords)
one_set = set(one_list)

len(one_set & set_of_150kwords) # set intersection is &
=> number of elements common to both sets

As per set theory, the intersection of two sets gives the elements that appear in both sets, then it's a simple matter of taking its length. For the second part (which of these words occur most, how many times etc.) Create a Counter with list_of_150kwords, That will tell you how many times each word appears in the list. And the intersection set will tell you which are the common words, solving both of your requirements.

Upvotes: 16

Hugh Bothwell
Hugh Bothwell

Reputation: 56684

from collections import Counter

search_data = [
    ["list", "of", "150k", "words"],
    ["another", "list", "of", "150k", "words"],
    ["yet", "another", "list", "of", "150k", "words"]
    # ... 17 more of these
]

search_words = ["four", "hundred", "words", "to", "search", "for"]

def word_finder(words_to_find):
    lookfor = set(word.lower() for word in words_to_find)
    def get_word_count(text):
        return Counter(word for word in (wd.lower() for wd in text) if word in lookfor)
    return get_word_count

def get_words_in_common(counters):
    # Maybe use c.viewkeys() instead of set(c)? Which is faster?
    return reduce(operator.and_, (set(c) for c in counters))

def main():
    wordcount = word_finder(search_words)
    counters = [wordcount(wordlst) for wordlst in search_data]
    common_to_all = get_words_in_common(counters)
    print(common_to_all)

if __name__=="__main__":
    main()

Upvotes: 1

Related Questions