Reputation: 11
I'm doing a project of Question Answering in python. I already have the vector of question and answer document together with the value of tfidf. but then i dunno how to calculate similarity matching in python.
Upvotes: 0
Views: 238
Reputation: 2886
You can use the Levenshtein distance, look here: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python for the code, and here: http://en.wikipedia.org/wiki/Levenshtein_distance for a discussion of the algorithm.
Here is a snippet copied from the above link:
def levenshtein(s1, s2):
if len(s1) < len(s2):
return levenshtein(s2, s1)
if not s1:
return len(s2)
previous_row = xrange(len(s2) + 1)
for i, c1 in enumerate(s1):
current_row = [i + 1]
for j, c2 in enumerate(s2):
insertions = previous_row[j + 1] + 1 # j+1 instead of j since previous_row and current_row are one character longer
deletions = current_row[j] + 1 # than s2
substitutions = previous_row[j] + (c1 != c2)
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]
Upvotes: 1
Reputation: 14251
You could use the Euclidean distance between the two vectors, or another distance metric (e.g., Hamming distance), or the cross-correlation of the vectors.
Upvotes: 1
Reputation: 10170
Cosine Similarity
length_question = .0
length_answer = .0
for word_tfidf in question:
length_question += word_tfidf**2
for word_tfdif in answer:
length_answer += word_tfidf**2
similarity = .0
for word in question:
question_word_tfidf = question[word]
answer_word_tfidf = answer.get(word, 0)
similarity += question_word_tfidf * answer_word_tfidf
similarity /= math.sqrt(length_question * length_answer)
Upvotes: 1