Sion C
Sion C

Reputation: 451

Suitable data structure for fast search on sets. imput: tags, output: sentence

I have the following problem.

I get 1-10 tags, related to an image, each have a probability of existence in image.

inputs: beach, woman, dog, tree ...

I would like to retrieve from database an already composed sentence which is most related to the tags.

e.g:

beach -> "fun at the beach" / "chilling on the beach" ....

beach, woman -> "woman at the beach"

beach, woman, dog - > none found!

take the closest exist but consider probability lets say: woman 0.95, beach 0.85, dog 0.7 so if exist take woman+beach(0.95, 0.85) then woman+dog and last beach+dog, the order is that higher are better but we are not summing.

I thought of using python sets but I am not really sure how.

Another option will be defaultdict:

db['beach']['woman']['dog'], but I want to get the same result also from: db['woman']['beeach']['dog']

I would like to get a nice solution. Thanks.

EDIT: Working solution

from collections import OrderedDict
list_of_keys = []
sentences = OrderedDict()
sentences[('dogs',)] = ['I like dogs','dogs are man best friends!']
sentences[('dogs', 'beach')] = ['the dog is at the beach']
sentences[('woman', 'cafe')] = ['The woman sat at the cafe.']
sentences[('woman', 'beach')] = ['The woman was at the beach']
sentences[('dress',)] = ['hi nice dress', 'what a nice dress !']


def keys_to_list_of_sets(dict_):
    list_of_keys = []
    for key in dict_:
        list_of_keys.append(set(key))

    return list_of_keys

def match_best_sentence(image_tags):
    for i, tags in enumerate(list_of_keys):
        if (tags & image_tags) == tags:
            print(list(sentences.keys())[i])

list_of_keys = keys_to_list_of_sets(sentences)
tags = set(['beach', 'dogs', 'woman'])
match_best_sentence(tags)

results:

('dogs',)
('dogs', 'beach')
('woman', 'beach')

This solution run over all keys of an ordered dictionary, o(n), I would like to see any performance improvement.

Upvotes: 1

Views: 73

Answers (2)

Erich
Erich

Reputation: 2072

What seems to be the simplest way of doing this without using DBs would be to keep sets for each word and take intersections.

More explicitly:

If a sentence contains the word "woman" then you put it into the "woman" set. Similarly for dog and beach etc. for each sentence. This means your space complexity is O(sentences*average_tags) as each sentence is repeated in the data structure.

You may have:

>>> dogs = set(["I like dogs", "the dog is at the beach"])
>>> woman = set(["The woman sat at the cafe.", "The woman was at the beach"])
>>> beach = set(["the dog is at the beach", "The woman was at the beach", "I do not like the beach"])
>>> dogs.intersection(beach)
{'the dog is at the beach'}

Which you can build into an object which is on top of defaultdict so that you can take a list of tags and you can intersect only those lists and return results.

Rough implementation idea:

from collections import defaultdict
class myObj(object): #python2
    def __init__(self):
        self.sets = defaultdict(lambda: set()) 

    def add_sentence(self, sentence, tags):
         #how you process tags is up to you, they could also be parsed from
         #the input string. 
         for t in tags:
             self.sets[tag].add(sentence)

    def get_match(self, tags):
         result = self.sets(tags[0]) #this is a hack 
         for t in tags[1:]:
             result = result.intersection(self.sets[t])

         return result #this function can stand to be improved but the idea is there

Maybe this will make it more clear how the default dict and sets will end up looking in the object.

>>> a = defaultdict(lambda: set())
>>> a['woman']
set([])
>>> a['woman'].add(1)
>>> str(a)
"defaultdict(<function <lambda> at 0x7fcb3bbf4b90>, {'woman': set([1])})"
>>> a['beach'].update([1,2,3,4])
>>> a['woman'].intersection(a['beach'])
set([1])
>>> str(a)
"defaultdict(<function <lambda> at 0x7fcb3bbf4b90>, {'woman': set([1]), 'beach': set([1, 2, 3, 4])})"

Upvotes: 1

abc
abc

Reputation: 11939

It mainly depends on the size of the database and on the number of combinations between keywords. Moreover, it depends also on which operation you do most.
If it's small and you need a fast find operation, a possibility is to use a dictionary with frozensets as keys which contain the tags and with values all the associated sentences.

For instance,

d=defaultdict(list)
# preprocessing
d[frozenset(["bob","car","red"])].append("Bob owns a red car")

# searching
d[frozenset(["bob","car","red"])]  #['Bob owns a red car']
d[frozenset(["red","car","bob"])]  #['Bob owns a red car']

For combinations of words like "bob","car" you have different possibilities according to the number of the keywords and on what matters more. For example

  • you could have additional entries with for each combination
  • you could iterate over the keys and check the ones that contain both car and bob

Upvotes: 0

Related Questions