user2966197
user2966197

Reputation: 2981

Best way to match whether a sub-string is present in the keys of a python dictionary

I have a Python dictionary, a sample structure of which is as below (excerpt):

items = {
    "Google": "Mountain View",
    "Johnson & Johnson": "New Brunswick",
    "Apple": "Cupertino",
}

Now what I have is a string, namely str1. What I want to do is look if any of the keys from dictionary items is present in string str1, for example if I have a string like Where is Google based out of?. Initially I wrote this pseudo code:

for str_word in str1.split():
    if str_word in items:
       print("Key found. Value is = ".format(items[str_word]))

Now this is good as dictionary keys are indexed/hashed. So the in operator runtime is constant but as you can notice this works fine for words like Google or Apple but this will not work for Johnson & Johnson (if my string is Where is Jonhnson & Johnson based out of?).

The other way I can think of is to first extract all the keys from the dictionary and then iterate one-by-one over each key and see if it is present in the str1 (reverse of first approach). This will increase the runtime as my dictionary is huge with hundreds or thousands of keys.

I want to know if there is a way I can modify my first approach to count for being able to match a sub-string with keys of a dictionary that could contain multiple words like Johnson & Johnson?

Upvotes: 4

Views: 448

Answers (4)

for multi-pattern matching in dictionary, recommended solution is Aho-Corasick , the aho-corasick algorithm used for static and dynamic pattern matching

also, you could use this solution for dynamic pattern matching by suffix tree

Upvotes: 0

abc
abc

Reputation: 11929

An approach may be the following:

items = {
        "Google":"Mountain View",
        "Johnson & Johnson": "New Brunswick",
        "Apple": "Cupertino"
}

words = "Where is Johnson & Johnson based out of?".rstrip("?").split()

#find the max number of words in a key
len_max = max([len(s.split()) for s in items])

#split the sentence in consecutive words according to the maximum number of words in a key, i.e., in consecutive groups of size 1,2,...,len_max_words_in_key
to_check = [" ".join(words[i:i+j]) for j in range(1,len_max + 1) for i in range(0,len(words)+1-j)]


#check for the key
for el in to_check:
     if el in items:
        print("Key found. Value is = {}".format(items[el]))

As sentences are short, the number of checks that need to be done is small.
For instance, for a sentence made by 20 words and a key made by a maximum of 5 words you have 90 = (20 + 19 + 18 + 17 + 16) checks in the dictionary to be done.

Upvotes: 0

Nisba
Nisba

Reputation: 3448

If your dictionary does not change, while your input string does (the one in which you want to find the keys as substring), one of the fastest approaches would be to use the Aho-Corasick algorithm.

The first step of the algorithm preprocesses the strings in your dictionary and this is done only once, independently of the input string, in O(m) time and space, where m is the sum of the lengths of the keys in the dictionary.

Then, the algorithm will find all the occurrences in an input string in O(n + m + k), where n is the length of the input string and k is the total number of occurrences of any key as a substring of the input string.

You can search for a Python implementation of the Aho-Corasick algorithm so that you will have only to integrate that into your code, without rewriting it.

Upvotes: 3

cap
cap

Reputation: 377

If your string is always the same/has a structure you could use regex to match the key you are looking for.

import re

string = 'Where is Johnson and Johnson based out of?'
match = re.search(r'Where is (.*) based out of?',string)
key = match.group(1)

Of course, you should modify this to match what you need.

Otherwise, I think I would go with your approach of iterating over the dict keys to see if they match with your string. Splitting str1 might give you trouble if you have more than one key with, for example &.

Upvotes: 0

Related Questions