Reputation: 67
This code implements Simhash function of four set of data.
import re
from simhash import Simhash, SimhashIndex
def get_features(s):
width = 3
s = s.lower()
s = re.sub(r'[^\w]+', '', s)
return [s[i:i + width] for i in range(max(len(s) - width + 1, 1))]
data = {
1: u'How are you? I Am fine. blar blar blar blar blar Thanks.',
2: u'How are you i am fine. blar blar blar blar blar than',
3: u'This is simhash test.',
}
objs = [(str(k), Simhash(get_features(v))) for k, v in data.items()]
index = SimhashIndex(objs, k=3)
now I have used this code to do indexing of a huge dataset (training dataset: train_data).
def get_features(s):
width = 3
return [s[i:i + width] for i in range(max(len(s) - width + 1, 1))]
objs = [(str(k), Simhash(get_features(data_train[k]))) for k in range(len(data_train))]
index=SimhashIndex(objs,k=500)
but if i put
'k=3'
it works but for values like
'k=500'
it goes into never ending loop. Please tell me why this is happening and how do i get index number for my all 'data_train' data.
Upvotes: 1
Views: 717
Reputation: 967
Without going into your code in detail, k is the maximum hamming distance you wish to allow. k can never be larger than the number of bits in your simhash, and typically it won't be larger than 6 or 7 for most real-world corpuses. Often it must be as small as 2 or 3.
Increasing k will cause drastic increase in CPU time and/or storage required to detect similarities. You won't see the effects of this until your system is under load, with lots of simhashes in your hash tables.
To better understand what k is, see this explanation of simhash.
Note also that you will not find similarities between the example texts you've hardcoded. They are very short and hence changing even one word changes too large a proportion of the features. Simhash can only detect similarities when changes are very slight.
Upvotes: 1