how to allot index number using SimhashIndex() to a document dataset?

Question

This code implements Simhash function of four set of data.

import re
from simhash import Simhash, SimhashIndex
def get_features(s):
   width = 3
   s = s.lower()
   s = re.sub(r'[^\w]+', '', s)
   return [s[i:i + width] for i in range(max(len(s) - width + 1, 1))]

data = {
1: u'How are you? I Am fine. blar blar blar blar blar Thanks.',
2: u'How are you i am fine. blar blar blar blar blar than',
3: u'This is simhash test.',
 }
objs = [(str(k), Simhash(get_features(v))) for k, v in data.items()]
index = SimhashIndex(objs, k=3)

now I have used this code to do indexing of a huge dataset (training dataset: train_data).

def get_features(s):
   width = 3
    return [s[i:i + width] for i in range(max(len(s) - width + 1, 1))]

objs = [(str(k), Simhash(get_features(data_train[k]))) for k in range(len(data_train))]
index=SimhashIndex(objs,k=500)

but if i put

'k=3'

it works but for values like

'k=500'

it goes into never ending loop. Please tell me why this is happening and how do i get index number for my all 'data_train' data.

how to allot index number using SimhashIndex() to a document dataset?

Answers (1)

Related Questions