Reputation: 33243
I have a file from which I am reading the data. I need advice on how to design the data structure which does the following: So, the data is of form
id_1::id_2::similiarity_score
Now, though the data is in this form but it also means that
id_2::id_1::same_similiarity_Score
So, what I want is a datastructure which when I use in program. So lets say I want to use this data in order to find which two items are similar
object.maxSimiliarity(object_id_1)
returns object_id_2 # has max score
but then this object_id_1 can also be in product_id_2 column in the database...
so in database in can be either of form:
object_id_1:: object_id_2::score
or object_id2::object_id_1::score
so I sort off want to design this datastructure in a way that
k_1, k_2:: value <--> k_2,k_1::value
Upvotes: 2
Views: 155
Reputation: 22619
The data look very much like nodes and edges of a weighted graph. If a
is similar to b
with a score 5.0
, and similar to c
with a score 1.0
, you might visualise it thus:
a
/ \
/ \
5.0 1.0
/ \
b c
Networkx is a python lib that provides ready-made graph objects and algorithms. Loading up your data into a weighted multigraph (that is, it supports multiple connections between nodes A--B
and B--A
is trivial. After that, getting the most similar object given an object id is a case of finding the node, finding it's most weighted edge and returning the node at the end of it.
import networkx as nx
## Test data
data = """\
a::b::2
b::a::3
a::c::5
b::e::1
"""
rows = (row.split('::') for row in data.split())
class Similarity(object):
def __init__(self, data):
self.g = nx.MultiGraph()
self.load(data)
def load(self, data):
## Turn the row into data suitable for networkx graph
rows = ((row[0], row[1], float(row[2])) for row in data)
self.g.add_weighted_edges_from(rows)
def most_similar(self, obj_id):
## Get edges from obj_id node
edges = self.g.edges_iter(obj_id, data=True)
## Sort by weight, get first, get joined node
return sorted([(i[0], i[1], i[2].get('weight', 0)) for i in edges])[-1][1]
sc = Similarity(rows)
sc.most_similar('a') ## 'c'
## Add some more data linking a --> f with a high score
sc.load([('a', 'f', 10)])
sc.most_similar('a') ## 'f'
Upvotes: 0
Reputation: 19601
A general trick for this sort of thing is to find a canonicalisation - a function that maps all members of a particular class to the same object. In this case, you might achieve it by sorting the first two components, which will transform B::A::Score to A::B::Score, while leaving A::B::Score as it is.
Upvotes: 3
Reputation: 226316
It seems to me that you could use the scores to build lists of best to worst matches:
d = {
'id1': [id_best_match_to_id1, id_next_best_match_to_id1, ..., id_worst_match_to_id1],
'id2': [id_best_match_to_id2, id_next_best_match_to_id2, ..., id_worst_match_to_id2],
...
}
If the similarity scores need to be retained, use a list of tuples in the form (id_best_match_to_id1, similarity_score_to_id1)
.
I don't see a way to exploit that similarity is a symmetric relation where sim(x,y)==sim(y,x)
.
Upvotes: 2