Ishan Sharma
Ishan Sharma

Reputation: 115

Hamming distance (Simhash python) giving out unexpected value

I was checking out Simhash module ( https://github.com/leonsim/simhash ).

I presume that the Simhash("String").distance(Simhash("Another string")) is the hamming distance between the two strings. Now, I am not sure I understand this "get_features(string) method completely, as shown in (https://leons.im/posts/a-python-implementation-of-simhash-algorithm/).

def get_features(s):
    width = 2
    s = s.lower()
    s = re.sub(r'[^\w]+', '', s)
    return [s[i:i + width] for i in range(max(len(s) - width + 1, 1))]

Now, when I try to compute distance between "aaaa" and "aaas" using the width 2, it gives out the distance as 0.

from simhash import Simhash

Simhash(get_features("aaas")).distance(Simhash(get_features("aaaa")))

I am not sure what am I missing out in here.

Upvotes: 1

Views: 3274

Answers (1)

Matthew Bao
Matthew Bao

Reputation: 41

Dig into code

The width, in your case, is the key parameter in get_features(), which give different splitted words. The get_features() in your case will output like:

['aa', 'aa', 'aa']

['aa', 'aa', 'as']

Then Simhash calculates these list as unweighted features (which means the default weight of each feature is 1) and output like:

86f24ba207a4912

86f24ba207a4912

They are the same!

The reason is from simhash algorithm itself. Let's look into the code:

def build_by_features(self, features):
    """
    `features` might be a list of unweighted tokens (a weight of 1
        will be assumed), a list of (token, weight) tuples or
        a token -> weight dict.
    """
    v = [0] * self.f
    masks = [1 << i for i in range(self.f)]
    if isinstance(features, dict):
        features = features.items()
    for f in features:
        if isinstance(f, basestring):
            h = self.hashfunc(f.encode('utf-8'))
            w = 1
        else:
            assert isinstance(f, collections.Iterable)
            h = self.hashfunc(f[0].encode('utf-8'))
            w = f[1]
        for i in range(self.f):
            v[i] += w if h & masks[i] else -w
    ans = 0
    for i in range(self.f):
        if v[i] >= 0:
            ans |= masks[i]
    self.value = ans

from: leonsim/simhash

The calculation process can be divied into 4 steps: 1) hash each splitted word (feature), to transform string into binary numbers; 2) weight them; 3) assumble weighted bits together; 4) change the assumbled number into binary and output as the value.

Now, in your case, the step 3 will output like:

[-3, 3, -3, -3, 3, -3, -3, -3, 3, -3, -3, 3, -3, -3, 3, -3, -3, 3, -3, 3, 3, 3, 3, -3, -3, -3, -3, -3, -3, 3, -3, -3, -3, 3, -3, 3, 3, 3, -3, 3, -3, -3, 3, -3, -3, 3, -3, -3, 3, 3, 3, 3, -3, 3, 3, -3, -3, -3, -3, 3, -3, -3, -3, -3]

[-1, 3, -3, -1, 3, -3, -3, -1, 3, -3, -3, 1, -1, -1, 1, -3, -3, 3, -1, 3, 1, 3, 1, -3, -1, -3, -3, -1, -1, 3, -1, -1, -1, 3, -1, 1, 3, 1, -1, 1, -3, -3, 1, -1, -3, 3, -3, -1, 1, 3, 3, 3, -3, 3, 3, -3, -1, -1, -1, 1, -3, -3, -3, -1]

And after step 4, the 2 output the same value.

Other parameter

If you change the width from 2 to 1, 3, 4, you will get different result of Simhash(get_features()).

Your case shows the limitation of simhash with short length text.

Upvotes: 3

Related Questions