Reputation: 333
I have two lists of words to compare. First list is 2 million words, second list is 150,000 words. What I need to do is to apply binary search to see if words of the first list appear in the second. I was trying liner search:
for word in words_list:
if word in dict_list:
print(word, 1)
else:
print(word, 0)
It works good, but it is very slow. Then I tried binary search but it did not work correctly:
for word in wordlist:
lb = 0
ub = len(dict_list)
mid_index = (lb + ub) // 2
item_at_mid = dict_list[mid_index]
if item_at_mid == word:
print(word)
if item_at_mid < word:
lb = mid_index + 1
else:
ub = mid_index
In the end I need two list first list of words that are in dictionary and second that are not.
Upvotes: 1
Views: 1582
Reputation: 4318
If you are using binary search your input should have been ordered.
One other possibility is convert your words_list
and dict_list
to set
and get the output as follows:
words common to both:
words_list.intersection(dict_list)
words not in one other:
words_list-dict_list
dict_list-words_list
Upvotes: 1
Reputation: 5847
A solution is to prefer using set
instead of list, because of the O(1)
complexity for __contains__
operation, as given in this solution.
If memory is a problem, then using a bloom filter is probably a good tradeoff (no false negative).
Here is a python implementation.
To create and maintain a binary tree, consider using heapq
standard module.
Upvotes: 0
Reputation: 672
If you want to do a Binary Search:
present = []
absent = []
for word in firstList:
lb,ub = 0,len(secondList) - 1
found = False
while lb <= ub:
mid = (lb + ub) // 2
if secondList[mid] == word:
found = True
break
elif secondList[mid] < word:
lb = mid + 1
else:
ub = mid - 1
if found:
present.append(word)
else:
absent.append(word)
Your binary search code was incorrect.
Upvotes: 1
Reputation: 786
You can use sets:
inter = set(words_list).intersection(dict_list)
for word in words_list:
if word in inter:
print(word, 1)
else:
print(word, 0)
Upvotes: 2