Reputation: 115
Here is my requirement in python:
I literally load a dictionary - say from /usr/share/dict/words (not to be confused with dictionary type) and use it to search for valid words. At the moment I am doing as follows:
dict_list = open('dictionary', 'r').read().split()
def search_dictionary(key):
p=re.compile(key)
# Comment: Is 'key' a prefix for a valid word in dictionary?
# If yes, return True, else return False
tmp_list = [x for x in dict_list if bool(p.match(x))]
if not tmp_list:
....
else:
....
Note that search_dictionary can be called large number of times and this is a bottleneck at the moment. Is there a more efficient way to do this string search? Like say pre-compiling dictionary. One can think of a dictionary attack use case. I am relatively a starter.
EDIT: I have updated the code with comments. As suggested in comments, I am possibly doing more work than needed.
Upvotes: 1
Views: 403
Reputation: 133919
Your algorithm runs in O(n) time with large constants. This seems so wrong, when a simple binary search could do O(lg n). If your regex does not contain special characters, why not:
import bisect
with open('dictionary') as f:
dictionary = f.read().split()
# .sort is slow, so better to sort
# words on disk! And/or run many searches
# for one invocation
dictionary.sort()
def bisect_search(key):
i = bisect.bisect_left(dictionary, key)
if i != len(dictionary):
return dictionary[i].startswith(key)
return False
Sorts the array, and then finds the lexicographically "smallest" word for which word >= key
and sees if it is a prefix for the given key.
Speed test against Padraic's first letter from dictionary, then linear search:
In [1]: %timeit bisect_search('thu')
1000000 loops, best of 3: 1.07 µs per loop
In [2]: %timeit search_dictionary('thu')
1000 loops, best of 3: 595 µs per loop
with 146880 words, and 5760 words starting with t
.
Upvotes: 3
Reputation: 180401
If you want to stop on the first match use any which will short circuit as soon as we find a match, in your code you always go through every word in the dictionary even if you get a match on the first word, you also build a list unnecessarily:
dict_list = open('dictionary').read().split()
def search_dictionary(key):
p = re.compile(key)
if any(p.match(x) for x in dict_list):
.....
You should also preferably only be creating your dictionary once not every time you call the function. Define it at the start of your code and pass it as a parameter if needed.
If you want to find prefixes using str.startswith
would probably be faster:
if any(x.startswith(key) for x in dict_list):
To optimise the startswith calls:
check = str.startswith
if any(check(x, key) for x in dict_list):
Or if it can appear anywhere just use in:
if any(key in x for x in dict_list):
using cpython2.7 using the optimised str.startswith method seems more efficient:
In [15]: s ="efficient"
In [16]: timeit p.match(s)
1000000 loops, best of 3: 359 ns per loop
In [17]: check = str.startswith
In [18]: timeit check(s,"eff")
1000000 loops, best of 3: 212 ns per loop
The difference for non matches is roughly the same
If you make a an actual dict from your dictionary where the keys are from a-z and the values are lists of words starting with the key, you can use the first letter from each key
in your function to do lookups only searching the words that start with the same letter.
from collections import defaultdict
word_dict = defaultdict(list)
with open("/usr/share/dict/words") as f:
for line in f:
line = line.rstrip().lower()
word_dict[line[0]].append(line)
You can see an example output use a key "z":
word_dict["z"]
['z', "z's", 'zachariah', "zachariah's", 'zachary', "zachary's", 'zachery', "zachery's", 'zagreb', "zagreb's", 'zaire', "zaire's", 'zairian', 'zambezi', "zambezi's", 'zambia', "zambia's", 'zambian', "zambian's", 'zambians', 'zamboni', 'zamenhof', "zamenhof's", 'zamora', 'zane', "zane's", 'zanuck', "zanuck's", 'zanzibar', "zanzibar's", 'zapata', 'zaporozhye', 'zapotec', 'zappa', "zappa's", 'zara', "zara's", 'zebedee', 'zechariah', 'zedekiah', "zedekiah's", 'zedong', "zedong's", 'zeffirelli', "zeffirelli's", 'zeke', "zeke's", 'zelig', 'zelma', "zelma's", 'zen', "zen's", 'zenger', "zenger's", 'zeno', "zeno's", 'zens', 'zephaniah', 'zephyrus', 'zeppelin', 'zest', "zest's", 'zeus', "zeus's", 'zhengzhou', 'zhivago', "zhivago's", 'zhukov', 'zibo', "zibo's", 'ziegfeld', 'ziegler', "ziegler's", 'ziggy', "ziggy's", 'zimbabwe', "zimbabwe's", 'zimbabwean', "zimbabwean's", 'zimbabweans', 'zimmerman', "zimmerman's", 'zinfandel', "zinfandel's", 'zion', "zion's", 'zionism', "zionism's", 'zionisms', 'zionist', "zionist's", 'zionists', 'zions', 'ziploc', 'zn', "zn's", 'zoe', "zoe's", 'zola', "zola's", 'zollverein', 'zoloft', 'zomba', "zomba's", 'zorn', 'zoroaster', "zoroaster's", 'zoroastrian', "zoroastrian's", 'zoroastrianism', "zoroastrianism's", 'zoroastrianisms', 'zorro', "zorro's", 'zosma', "zosma's", 'zr', "zr's", 'zsigmondy', 'zubenelgenubi', "zubenelgenubi's", 'zubeneschamali', "zubeneschamali's", 'zukor', "zukor's", 'zulu', "zulu's", 'zulus', 'zuni', 'zwingli', "zwingli's", 'zworykin', 'zyrtec', "zyrtec's", 'zyuganov', "zyuganov's", 'zürich', "zürich's", 'z', 'zanier', 'zanies', 'zaniest', 'zaniness', "zaniness's", 'zany', "zany's", 'zap', "zap's", 'zapped', 'zapping', 'zaps', 'zeal', "zeal's", 'zealot', "zealot's", 'zealots', 'zealous', 'zealously', 'zealousness', "zealousness's", 'zebra', "zebra's", 'zebras', 'zebu', "zebu's", 'zebus', 'zed', "zed's", 'zeds', 'zenith', "zenith's", 'zeniths', 'zephyr', "zephyr's", 'zephyrs', 'zeppelin', "zeppelin's", 'zeppelins', 'zero', "zero's", 'zeroed', 'zeroes', 'zeroing', 'zeros', 'zest', "zest's", 'zestful', 'zestfully', 'zests', 'zeta', 'zigzag', "zigzag's", 'zigzagged', 'zigzagging', 'zigzags', 'zilch', "zilch's", 'zillion', "zillion's", 'zillions', 'zinc', "zinc's", 'zinced', 'zincing', 'zincked', 'zincking', 'zincs', 'zing', "zing's", 'zinged', 'zinger', "zinger's", 'zingers', 'zinging', 'zings', 'zinnia', "zinnia's", 'zinnias', 'zip', "zip's", 'zipped', 'zipper', "zipper's", 'zippered', 'zippering', 'zippers', 'zippier', 'zippiest', 'zipping', 'zippy', 'zips', 'zircon', "zircon's", 'zirconium', "zirconium's", 'zircons', 'zit', "zit's", 'zither', "zither's", 'zithers', 'zits', 'zodiac', "zodiac's", 'zodiacal', 'zodiacs', 'zombi', "zombi's", 'zombie', "zombie's", 'zombies', 'zombis', 'zonal', 'zone', "zone's", 'zoned', 'zones', 'zoning', 'zonked', 'zoo', "zoo's", 'zoological', 'zoologist', "zoologist's", 'zoologists', 'zoology', "zoology's", 'zoom', "zoom's", 'zoomed', 'zooming', 'zooms', 'zoos', 'zucchini', "zucchini's", 'zucchinis', 'zwieback', "zwieback's", 'zygote', "zygote's", 'zygotes']
So fetch the vals use word_dict[key]
to get the appropriate values:
def search_dictionary(key):
check = str.startswith
vals = word_dict[key[0]]
if any(check(x, key) for x in vals):
Not sure if you have considered case or not so you may want to remove the lower call depending on what you want.
Upvotes: 2