Mon
Mon

Reputation: 61

Text semantic similarity by analogy in the hypernym level using Python

I have a few long (50-lines) paragraphs which I would like to measure their similarity using Python. I am more interested in the semantic similarity of these texts in the hypernym (a term in linguistic) level with the focus on functions and processes. To further clarify, I would call two pieces of text similar if both referring to the same function or process, regardless of the words used in them.

Here are two examples: Similar_Sentences = ("use a tube to suck soda in","transfer blood to the heart using a pump and artery"). Unsimilar_Sentences = ("use a tube to suck soda in","do some programming to get better").

In the first example, "tube" ~ "artery", "soda" ~ "blood", and "suck in" ~ "transfer to". I hope it is clear what I am interested in.

Based on my research on NLP algorithms and tools, NLTK and WordNet in Python seem the right tools for this task, but I am not sure how.

Referring to any relevant tutorial or source for learning, as well as any suggestions is appreciated in advance.

Upvotes: 1

Views: 770

Answers (1)

David Dale
David Dale

Reputation: 11434

There is a great post on NLPForHackers describing how to implement sentence similarity using wordnet.

Their ingredients:

  1. A POS tagger to narrow down the list of synsets for each word (after this, the authors just take the first synset)
  2. path_similarity: a metric showing how far two words are from each other on the taxonomy graph
  3. Aggregation of word-level similarities: max, then average.

This already works pretty well: for your positive example, the similarity score is 0.29, and for the negative example, the score is only 0.20.

I would suggest a few improvements:

  1. Use not only the first found synset, but max over all synsets for a word. It makes the scores for your positive and negative examples 0.36 and 0.23 - farther from each other than before
  2. Use word mover distance to aggregate word similarities, instead of max-and-mean. I convert between similarity and distance by formula s=1-d^2/2. This pushes scores for your positive and negative samples even further apart - to 0.41 and 0.19 respectively.

Here is the code for my final version:

from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet as wn
import numpy as np
from pyemd import emd

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

def penn_to_wn(tag):
    """ Convert between a Penn Treebank tag to a simplified Wordnet tag """
    if tag.startswith('N'):
        return 'n'
    if tag.startswith('V'):
        return 'v'
    if tag.startswith('J'):
        return 'a'
    if tag.startswith('R'):
        return 'r'
    return None

def tagged_to_synsets(word, tag):
    wn_tag = penn_to_wn(tag)
    if wn_tag is None:
        return []
    return wn.synsets(word, wn_tag)

def get_counts(sentence, vocab):
    weights = np.zeros(len(vocab))
    for w in sentence:
        if w not in vocab:
            continue
        weights[vocab.index(w)] += 1
    return weights / sum(weights)

def sim3(sentence1, sentence2):
    sentence1 = pos_tag(word_tokenize(sentence1))
    sentence2 = pos_tag(word_tokenize(sentence2))
    vocab = [pair for pair in sorted(set(sentence1).union(set(sentence2))) if penn_to_wn(pair[1])]

    w1 = get_counts(sentence1, vocab)
    w2 = get_counts(sentence2, vocab)

    synsets = [tagged_to_synsets(*tagged_word) for tagged_word in vocab]

    similarities = np.array([[
        max([s1.path_similarity(s2) or 0 for s1 in w1 for s2 in w2], default=0)
        for w2 in synsets] for w1 in synsets]
    )
    distances = np.sqrt(2*(1-similarities))
    distance = emd(w1, w2, distances)
    similarity = 1 - distance**2 / 2
    return similarity

print(sim3("use a tube to suck soda in","transfer blood to the heart using a pump and artery"))
print(sim3("use a tube to suck soda in","do some programming to get better"))
# 0.41046117311104957
# 0.19280421873943732

We can try to evaluate this similarity method on a dataset - e.g. on Quora Question Pairs

import pandas as pd
from tqdm.auto import tqdm, trange
import matplotlib.pyplot as plt

df = pd.read_csv('http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv', sep='\t')
sample = df.sample(1000, random_state=1)
sims = pd.Series([sim3(sample.iloc[i].question1, sample.iloc[i].question2) for i in trange(sample.shape[0])], index=sample.index)

# produce a plot
sims[sample.is_duplicate==0].hist(density=True);
sims[sample.is_duplicate==1].hist(alpha=0.5, density=True);
plt.legend(['non-duplicates', 'duplicates'])
plt.title('distribution of wordnet-sentence-similarity\n on quora question pairs');

You can see from the image that scores for duplicate pairs are on average much higher than for non-duplicates, but the overlap is still huge.

enter image description here

If you want a quantitative metric, you can evaluate e.g. ROC AUC. On this dataset, it is 70%, which is far from perfect but makes a decent baseline.

from sklearn.metrics import roc_auc_score
print(roc_auc_score(sample.is_duplicate, sims))
# 0.7075210210273749

Upvotes: 1

Related Questions