Reputation: 115
I was checking out Simhash module ( https://github.com/leonsim/simhash ).
I presume that the Simhash("String").distance(Simhash("Another string")) is the hamming distance between the two strings. Now, I am not sure I understand this "get_features(string) method completely, as shown in (https://leons.im/posts/a-python-implementation-of-simhash-algorithm/).
def get_features(s):
width = 2
s = s.lower()
s = re.sub(r'[^\w]+', '', s)
return [s[i:i + width] for i in range(max(len(s) - width + 1, 1))]
Now, when I try to compute distance between "aaaa" and "aaas" using the width 2, it gives out the distance as 0.
from simhash import Simhash
Simhash(get_features("aaas")).distance(Simhash(get_features("aaaa")))
I am not sure what am I missing out in here.
Upvotes: 1
Views: 3274
Reputation: 41
Dig into code
The width, in your case, is the key parameter in get_features()
, which give different splitted words. The get_features()
in your case will output like:
['aa', 'aa', 'aa']
['aa', 'aa', 'as']
Then Simhash calculates these list as unweighted features (which means the default weight of each feature is 1) and output like:
86f24ba207a4912
86f24ba207a4912
They are the same!
The reason is from simhash algorithm itself. Let's look into the code:
def build_by_features(self, features):
"""
`features` might be a list of unweighted tokens (a weight of 1
will be assumed), a list of (token, weight) tuples or
a token -> weight dict.
"""
v = [0] * self.f
masks = [1 << i for i in range(self.f)]
if isinstance(features, dict):
features = features.items()
for f in features:
if isinstance(f, basestring):
h = self.hashfunc(f.encode('utf-8'))
w = 1
else:
assert isinstance(f, collections.Iterable)
h = self.hashfunc(f[0].encode('utf-8'))
w = f[1]
for i in range(self.f):
v[i] += w if h & masks[i] else -w
ans = 0
for i in range(self.f):
if v[i] >= 0:
ans |= masks[i]
self.value = ans
The calculation process can be divied into 4 steps: 1) hash each splitted word (feature), to transform string into binary numbers; 2) weight them; 3) assumble weighted bits together; 4) change the assumbled number into binary and output as the value.
Now, in your case, the step 3 will output like:
[-3, 3, -3, -3, 3, -3, -3, -3, 3, -3, -3, 3, -3, -3, 3, -3, -3, 3, -3, 3, 3, 3, 3, -3, -3, -3, -3, -3, -3, 3, -3, -3, -3, 3, -3, 3, 3, 3, -3, 3, -3, -3, 3, -3, -3, 3, -3, -3, 3, 3, 3, 3, -3, 3, 3, -3, -3, -3, -3, 3, -3, -3, -3, -3] [-1, 3, -3, -1, 3, -3, -3, -1, 3, -3, -3, 1, -1, -1, 1, -3, -3, 3, -1, 3, 1, 3, 1, -3, -1, -3, -3, -1, -1, 3, -1, -1, -1, 3, -1, 1, 3, 1, -1, 1, -3, -3, 1, -1, -3, 3, -3, -1, 1, 3, 3, 3, -3, 3, 3, -3, -1, -1, -1, 1, -3, -3, -3, -1]
And after step 4, the 2 output the same value.
Other parameter
If you change the width from 2 to 1, 3, 4, you will get different result of
Simhash(get_features())
.
Your case shows the limitation of simhash with short length text.
Upvotes: 3