frazman
frazman

Reputation: 33243

Advice on how to design a datastructure

I have a file from which I am reading the data. I need advice on how to design the data structure which does the following: So, the data is of form

id_1::id_2::similiarity_score

Now, though the data is in this form but it also means that

id_2::id_1::same_similiarity_Score

So, what I want is a datastructure which when I use in program. So lets say I want to use this data in order to find which two items are similar

object.maxSimiliarity(object_id_1)
returns object_id_2 # has max score

but then this object_id_1 can also be in product_id_2 column in the database...

so in database in can be either of form:

 object_id_1:: object_id_2::score
 or object_id2::object_id_1::score

so I sort off want to design this datastructure in a way that

k_1, k_2:: value <--> k_2,k_1::value

Upvotes: 2

Views: 155

Answers (3)

Rob Cowie
Rob Cowie

Reputation: 22619

The data look very much like nodes and edges of a weighted graph. If a is similar to b with a score 5.0, and similar to c with a score 1.0, you might visualise it thus:

                                    a
                                   / \
                                  /   \
                                5.0   1.0
                                /       \
                               b         c

Networkx is a python lib that provides ready-made graph objects and algorithms. Loading up your data into a weighted multigraph (that is, it supports multiple connections between nodes A--B and B--A is trivial. After that, getting the most similar object given an object id is a case of finding the node, finding it's most weighted edge and returning the node at the end of it.

import networkx as nx

## Test data
data = """\
a::b::2
b::a::3
a::c::5
b::e::1
"""
rows = (row.split('::') for row in data.split())


class Similarity(object):
    def __init__(self, data):
        self.g = nx.MultiGraph()
        self.load(data)

    def load(self, data):
        ## Turn the row into data suitable for networkx graph
        rows = ((row[0], row[1], float(row[2])) for row in data)
        self.g.add_weighted_edges_from(rows)

    def most_similar(self, obj_id):
        ## Get edges from obj_id node
        edges = self.g.edges_iter(obj_id, data=True)
        ## Sort by weight, get first, get joined node
        return sorted([(i[0], i[1], i[2].get('weight', 0)) for i in edges])[-1][1]


sc = Similarity(rows)
sc.most_similar('a') ## 'c'
## Add some more data linking a --> f with a high score
sc.load([('a', 'f', 10)])
sc.most_similar('a') ## 'f'

Upvotes: 0

mcdowella
mcdowella

Reputation: 19601

A general trick for this sort of thing is to find a canonicalisation - a function that maps all members of a particular class to the same object. In this case, you might achieve it by sorting the first two components, which will transform B::A::Score to A::B::Score, while leaving A::B::Score as it is.

Upvotes: 3

Raymond Hettinger
Raymond Hettinger

Reputation: 226316

It seems to me that you could use the scores to build lists of best to worst matches:

d = {
     'id1':  [id_best_match_to_id1, id_next_best_match_to_id1, ..., id_worst_match_to_id1],
     'id2':  [id_best_match_to_id2, id_next_best_match_to_id2, ..., id_worst_match_to_id2],
     ...
}

If the similarity scores need to be retained, use a list of tuples in the form (id_best_match_to_id1, similarity_score_to_id1).

I don't see a way to exploit that similarity is a symmetric relation where sim(x,y)==sim(y,x).

Upvotes: 2

Related Questions