Azura Ahmad
Azura Ahmad

Reputation: 11

similarity matching calculation in python

I'm doing a project of Question Answering in python. I already have the vector of question and answer document together with the value of tfidf. but then i dunno how to calculate similarity matching in python.

Upvotes: 0

Views: 238

Answers (3)

Jonatan
Jonatan

Reputation: 2886

You can use the Levenshtein distance, look here: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python for the code, and here: http://en.wikipedia.org/wiki/Levenshtein_distance for a discussion of the algorithm.

Here is a snippet copied from the above link:

def levenshtein(s1, s2):
    if len(s1) < len(s2):
        return levenshtein(s2, s1)
    if not s1:
        return len(s2)

    previous_row = xrange(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1 # j+1 instead of j since previous_row and current_row are one character longer
            deletions = current_row[j] + 1       # than s2
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row

    return previous_row[-1]

Upvotes: 1

Junuxx
Junuxx

Reputation: 14251

You could use the Euclidean distance between the two vectors, or another distance metric (e.g., Hamming distance), or the cross-correlation of the vectors.

Upvotes: 1

user278064
user278064

Reputation: 10170

Cosine Similarity

length_question = .0
length_answer = .0

for word_tfidf in question:
    length_question += word_tfidf**2

for word_tfdif in answer:
     length_answer += word_tfidf**2

similarity = .0
for word in question:
    question_word_tfidf = question[word]
    answer_word_tfidf = answer.get(word, 0)
    similarity += question_word_tfidf * answer_word_tfidf
similarity /= math.sqrt(length_question * length_answer)

Upvotes: 1

Related Questions