Reputation: 451
I have the following problem.
I get 1-10 tags, related to an image, each have a probability of existence in image.
inputs: beach, woman, dog, tree ...
I would like to retrieve from database an already composed sentence which is most related to the tags.
e.g:
beach -> "fun at the beach" / "chilling on the beach" ....
beach, woman -> "woman at the beach"
beach, woman, dog - > none found!
take the closest exist but consider probability lets say: woman 0.95, beach 0.85, dog 0.7 so if exist take woman+beach(0.95, 0.85) then woman+dog and last beach+dog, the order is that higher are better but we are not summing.
I thought of using python sets but I am not really sure how.
Another option will be defaultdict:
db['beach']['woman']['dog'], but I want to get the same result also from: db['woman']['beeach']['dog']
I would like to get a nice solution. Thanks.
EDIT: Working solution
from collections import OrderedDict
list_of_keys = []
sentences = OrderedDict()
sentences[('dogs',)] = ['I like dogs','dogs are man best friends!']
sentences[('dogs', 'beach')] = ['the dog is at the beach']
sentences[('woman', 'cafe')] = ['The woman sat at the cafe.']
sentences[('woman', 'beach')] = ['The woman was at the beach']
sentences[('dress',)] = ['hi nice dress', 'what a nice dress !']
def keys_to_list_of_sets(dict_):
list_of_keys = []
for key in dict_:
list_of_keys.append(set(key))
return list_of_keys
def match_best_sentence(image_tags):
for i, tags in enumerate(list_of_keys):
if (tags & image_tags) == tags:
print(list(sentences.keys())[i])
list_of_keys = keys_to_list_of_sets(sentences)
tags = set(['beach', 'dogs', 'woman'])
match_best_sentence(tags)
results:
('dogs',)
('dogs', 'beach')
('woman', 'beach')
This solution run over all keys of an ordered dictionary, o(n), I would like to see any performance improvement.
Upvotes: 1
Views: 73
Reputation: 2072
What seems to be the simplest way of doing this without using DBs would be to keep sets for each word and take intersections.
More explicitly:
If a sentence contains the word "woman" then you put it into the "woman" set. Similarly for dog and beach etc. for each sentence. This means your space complexity is O(sentences*average_tags) as each sentence is repeated in the data structure.
You may have:
>>> dogs = set(["I like dogs", "the dog is at the beach"])
>>> woman = set(["The woman sat at the cafe.", "The woman was at the beach"])
>>> beach = set(["the dog is at the beach", "The woman was at the beach", "I do not like the beach"])
>>> dogs.intersection(beach)
{'the dog is at the beach'}
Which you can build into an object which is on top of defaultdict so that you can take a list of tags and you can intersect only those lists and return results.
Rough implementation idea:
from collections import defaultdict
class myObj(object): #python2
def __init__(self):
self.sets = defaultdict(lambda: set())
def add_sentence(self, sentence, tags):
#how you process tags is up to you, they could also be parsed from
#the input string.
for t in tags:
self.sets[tag].add(sentence)
def get_match(self, tags):
result = self.sets(tags[0]) #this is a hack
for t in tags[1:]:
result = result.intersection(self.sets[t])
return result #this function can stand to be improved but the idea is there
Maybe this will make it more clear how the default dict and sets will end up looking in the object.
>>> a = defaultdict(lambda: set())
>>> a['woman']
set([])
>>> a['woman'].add(1)
>>> str(a)
"defaultdict(<function <lambda> at 0x7fcb3bbf4b90>, {'woman': set([1])})"
>>> a['beach'].update([1,2,3,4])
>>> a['woman'].intersection(a['beach'])
set([1])
>>> str(a)
"defaultdict(<function <lambda> at 0x7fcb3bbf4b90>, {'woman': set([1]), 'beach': set([1, 2, 3, 4])})"
Upvotes: 1
Reputation: 11939
It mainly depends on the size of the database and on the number of combinations between keywords. Moreover, it depends also on which operation you do most.
If it's small and you need a fast find
operation, a possibility is to use a dictionary with frozensets
as keys which contain the tags and with values all the associated sentences.
For instance,
d=defaultdict(list)
# preprocessing
d[frozenset(["bob","car","red"])].append("Bob owns a red car")
# searching
d[frozenset(["bob","car","red"])] #['Bob owns a red car']
d[frozenset(["red","car","bob"])] #['Bob owns a red car']
For combinations of words like "bob","car" you have different possibilities according to the number of the keywords and on what matters more. For example
Upvotes: 0