Anurag
Anurag

Reputation: 482

Check if a substring of a string is in a list of strings in python

I have a dictionary of foods:

foods={
  "chicken masala" : "curry",
  "chicken burger" : "burger",
  "beef burger" : "burger",
  "chicken soup" : "appetizer",
  "vegetable" : "curry"
}

Now I have a list of strings:

queries = ["best burger", "something else"]

I have to find out if there is any string in queries that has and entry in our food dictionary. Like in the above example it should return True for best burger. Currently, I am calculating cosine similarity between each string in the list for all the entries in the foods.keys(). It works but it's very time inefficient. The food dictionary has almost 1000 entries. Is there any efficient way to do so?

Edit:

Here the best burger should be returned because there is burger in it and burger is also present in chicken burger in foods.keys(). I am basically trying to find out if there is any query which is a food type.

This is how I am calculating :

import re, math
from collections import Counter

WORD = re.compile(r'\w+')

def get_cosine(text1, text2):
     vec1 = text_to_vector(text1.lower())
     vec2 = text_to_vector(text2.lower())
     intersection = set(vec1.keys()) & set(vec2.keys())
     numerator = sum([vec1[x] * vec2[x] for x in intersection])

     sum1 = sum([vec1[x]**2 for x in vec1.keys()])
     sum2 = sum([vec2[x]**2 for x in vec2.keys()])
     denominator = math.sqrt(sum1) * math.sqrt(sum2)

     if not denominator:
        return 0.0
     else:
        return (float(numerator) / denominator) * 100

foods={
  "chicken masala" : "curry",
  "chicken burger" : "burger",
  "beef burger" : "burger",
  "chicken soup" : "appetizer",
  "vegetable" : "curry"
}
queries = ["best burger", "something else"]
flag = False
food = []
for phrase in queries:
   for k in foods.keys():
      cosine = get_cosine(phrase, k)
      if int(cosine) > 40:
         flag = True
         food.append(phrase)
         break

print('Foods:', food)

OUTPUT:

Foods: ['best burger']

Solution: Though @Black Thunder's solution works for the example I have provided in the example but it doesn't work for queries like best burgers. But this solution works in that case. Which is a major concern for me. Thanks @Andrej Kesely. This was the reason I went for the cosine similarity in my solution. But i think SequenceMatcher works better here.

Upvotes: 0

Views: 276

Answers (4)

Andrej Kesely
Andrej Kesely

Reputation: 195573

You can use difflib (doc) to find similarities (It will probably need some tweaking with coefficients):

foods={
  "chicken masala" : "curry",
  "chicken burger" : "burger",
  "beef burger" : "burger",
  "chicken soup" : "appetizer",
  "vegetable" : "curry"
}

queries = ["best burger", "order"]

from difflib import SequenceMatcher

out = []
for q in queries:
    for k in foods:
        r = SequenceMatcher(None, k, q).ratio()
        print('q={: <20} k={: <20} ratio={}'.format(q, k, r))
        if r > 0.5:
            out.append(k)

print(out)

Prints:

q=best burger          k=chicken masala       ratio=0.16
q=best burger          k=chicken burger       ratio=0.64
q=best burger          k=beef burger          ratio=0.8181818181818182
q=best burger          k=chicken soup         ratio=0.2608695652173913
q=best burger          k=vegetable            ratio=0.3
q=order                k=chicken masala       ratio=0.10526315789473684
q=order                k=chicken burger       ratio=0.3157894736842105
q=order                k=beef burger          ratio=0.375
q=order                k=chicken soup         ratio=0.11764705882352941
q=order                k=vegetable            ratio=0.14285714285714285
['chicken burger', 'beef burger']

Upvotes: 1

Nouman
Nouman

Reputation: 7313

Try this code:

queries = ["best burger", "order"]
foods={
  "chicken masala" : "curry",
  "chicken burger" : "burger",
  "beef burger" : "burger",
  "chicken soup" : "appetizer",
  "vegetable" : "curry"
}
output = []
for y in queries:                 #looping through the queries
    for x in y.split(" "):        #spliting the data in the queries for matches
        for z in foods:           #taking the keys (same as foods.keys)
            if x in z:            #Checking if the data in queries matches any data in the keys
                output.append(z)  #if matches, appending the data
print(output)

Output:

['chicken burger', 'beef burger']

Upvotes: 1

Kashyap KN
Kashyap KN

Reputation: 419

You can do something simple like this

First get all the keys

data = foods.keys()

Now convert list of strings to one single string comma separated. This will be much easier to check for substring matching,

queries = ','.join(queries)

Now check for substring matching

for food in data:
    food = food.split()
        for item in food:
            if item in data:
                print True

Upvotes: 0

Clepsyd
Clepsyd

Reputation: 551

If what you want is a list of matches between queries and foods keys, you could use a list comprehension:

matches = [food for food in queries if food in foods]

Upvotes: -1

Related Questions