Yan
Yan

Reputation: 333

Binary search of words in python list

I have two lists of words to compare. First list is 2 million words, second list is 150,000 words. What I need to do is to apply binary search to see if words of the first list appear in the second. I was trying liner search:

for word in words_list:
    if word in dict_list:
       print(word, 1)
    else:
       print(word, 0)

It works good, but it is very slow. Then I tried binary search but it did not work correctly:

for word in wordlist:
    lb = 0
    ub = len(dict_list)
    mid_index = (lb + ub) // 2
    item_at_mid = dict_list[mid_index]
    if item_at_mid == word:
        print(word)
    if item_at_mid < word:
        lb = mid_index + 1
    else:
        ub = mid_index

In the end I need two list first list of words that are in dictionary and second that are not.

Upvotes: 1

Views: 1582

Answers (4)

venpa
venpa

Reputation: 4318

If you are using binary search your input should have been ordered. One other possibility is convert your words_list and dict_list to set and get the output as follows:

words common to both:

words_list.intersection(dict_list)

words not in one other:

words_list-dict_list
dict_list-words_list

Upvotes: 1

aluriak
aluriak

Reputation: 5847

A solution is to prefer using set instead of list, because of the O(1) complexity for __contains__ operation, as given in this solution.

If memory is a problem, then using a bloom filter is probably a good tradeoff (no false negative).

Here is a python implementation.


To create and maintain a binary tree, consider using heapq standard module.

Upvotes: 0

Prajwal K R
Prajwal K R

Reputation: 672

If you want to do a Binary Search:

present = []
absent = []
for word in firstList:
    lb,ub = 0,len(secondList) - 1
    found = False
    while lb <= ub:
        mid = (lb + ub) // 2
        if secondList[mid] == word:
            found = True
            break
        elif secondList[mid] < word:
            lb = mid + 1
        else:
            ub = mid - 1

    if found:
        present.append(word)
    else:
        absent.append(word)

Your binary search code was incorrect.

Upvotes: 1

Ofer Helman
Ofer Helman

Reputation: 786

You can use sets:

inter = set(words_list).intersection(dict_list)
for word in words_list:
    if word in inter:
        print(word, 1)
    else:
        print(word, 0)

Upvotes: 2

Related Questions