TIMEX
TIMEX

Reputation: 272124

What's the best way to find the similarity among these vectors?

v1 = [33, 24, 55, 56]
v2 = [32, 25, 51, 40]
v3 = [ ... ]
v4 = [ ... ]

Normally, to find which vector is the most similar to v1, I would run v1 against the other vectors with a cosine similarity algorithm.

Now, I have a more complex set of vectors with the structure:

v1 = [ { 'a': 4, 'b':9, 'c': 12 ... },
       { 'a', 3, 'g':3, 'b': 33 ... },
       { 'b', 1, 'k': 6, 'n': 19 ... },
       ...
     ]
v2 = [ {}, {}, {} ... ]
v3 = [ {}, {}, {} ... ]
v4 = [ {}, {}, {} ... ]

Given this structure, how would you calculate similarity? (A good match would be a vector with many keys similar to v1, with values of those keys very similar as v1's values)

btilly's answer:

def cosine_sim_complex(v, w):
    '''
    Complex version of cosine similarity
    '''
    def complicated_dot(v, w):
        dot = 0
        for (v_i, w_i) in zip(v, w):
            #{ _, _ }, {_, _}
            for x in v_i:
                if x in w_i:
                    dot += v_i[x] * w_i[x]
        return float(dot)
    length_v = float(complicated_dot(v, v) ** 0.5)
    length_w = float(complicated_dot(w, w) ** 0.5)
    score = complicated_dot(v, w) /  length_v / length_w
    return score


v1 = [ {'a':44, 'b':21 }, { 'a': 55, 'c': 22 } ]
v2 = [ {'a':99, 'b':21 }, { 'a': 55, 'c': 22 } ]
cosine_sim_complex(v1, v2)
1.01342687531

Upvotes: 1

Views: 841

Answers (2)

btilly
btilly

Reputation: 46455

You do the same thing in more dimensions.

Previously you just had 4 dimensions. Now you have a much larger set of dimensions with 2-dimensional labeling of the indices. But the math remains the same. You have a dot product like this untested code:

def complicated_dot(v, w):
    dot = 0
    for (v_i, w_i) in zip(v, w):
        for x in v_i.iterkeys():
            if x in w_i:
                dot += v_i[x] * w_i[x]
    return dot

And then you can apply the cosine similarity algorithm that you already know.

Upvotes: 2

jerboa
jerboa

Reputation: 1421

You can usage set and operation ixor (^) for every item. And I supose size all dicts it is equals.

diffs = []
vs = (v2, v3, v4)
for vcmp in vs:
    diff = 0
    for v_item_index in range(len(vcmp)):
        diff += set(vcmp[v_item_index]) ^ set(v[v_item_index])
    diffs.append(diff)

print diffs

Now that items in diffs that contain low value have index most similar vector.

Upvotes: 0

Related Questions