Reputation: 2981
I have a Python dictionary, a sample structure of which is as below (excerpt):
items = {
"Google": "Mountain View",
"Johnson & Johnson": "New Brunswick",
"Apple": "Cupertino",
}
Now what I have is a string, namely str1
. What I want to do is look if any of the keys from dictionary items
is present in string str1
, for example if I have a string like Where is Google based out of?
. Initially I wrote this pseudo code:
for str_word in str1.split():
if str_word in items:
print("Key found. Value is = ".format(items[str_word]))
Now this is good as dictionary keys are indexed/hashed. So the in
operator runtime is constant but as you can notice this works fine for words like Google
or Apple
but this will not work for Johnson & Johnson
(if my string is Where is Jonhnson & Johnson based out of?
).
The other way I can think of is to first extract all the keys from the dictionary and then iterate one-by-one over each key and see if it is present in the str1
(reverse of first approach). This will increase the runtime as my dictionary is huge with hundreds or thousands of keys.
I want to know if there is a way I can modify my first approach to count for being able to match a sub-string with keys of a dictionary that could contain multiple words like Johnson & Johnson
?
Upvotes: 4
Views: 448
Reputation: 1
for multi-pattern matching in dictionary, recommended solution is Aho-Corasick , the aho-corasick algorithm used for static and dynamic pattern matching
also, you could use this solution for dynamic pattern matching by suffix tree
Upvotes: 0
Reputation: 11929
An approach may be the following:
items = {
"Google":"Mountain View",
"Johnson & Johnson": "New Brunswick",
"Apple": "Cupertino"
}
words = "Where is Johnson & Johnson based out of?".rstrip("?").split()
#find the max number of words in a key
len_max = max([len(s.split()) for s in items])
#split the sentence in consecutive words according to the maximum number of words in a key, i.e., in consecutive groups of size 1,2,...,len_max_words_in_key
to_check = [" ".join(words[i:i+j]) for j in range(1,len_max + 1) for i in range(0,len(words)+1-j)]
#check for the key
for el in to_check:
if el in items:
print("Key found. Value is = {}".format(items[el]))
As sentences are short, the number of checks that need to be done is small.
For instance, for a sentence made by 20
words and a key made by a maximum of 5
words you have 90 = (20 + 19 + 18 + 17 + 16)
checks in the dictionary to be done.
Upvotes: 0
Reputation: 3448
If your dictionary does not change, while your input string does (the one in which you want to find the keys as substring), one of the fastest approaches would be to use the Aho-Corasick algorithm.
The first step of the algorithm preprocesses the strings in your dictionary and this is done only once, independently of the input string, in O(m)
time and space, where m
is the sum of the lengths of the keys in the dictionary.
Then, the algorithm will find all the occurrences in an input string in O(n + m + k)
, where
n
is the length of the input string and k
is the total number of occurrences of any key as a substring of the input string.
You can search for a Python implementation of the Aho-Corasick algorithm so that you will have only to integrate that into your code, without rewriting it.
Upvotes: 3
Reputation: 377
If your string is always the same/has a structure you could use regex to match the key you are looking for.
import re
string = 'Where is Johnson and Johnson based out of?'
match = re.search(r'Where is (.*) based out of?',string)
key = match.group(1)
Of course, you should modify this to match what you need.
Otherwise, I think I would go with your approach of iterating over the dict keys to see if they match with your string. Splitting str1
might give you trouble if you have more than one key with, for example &
.
Upvotes: 0