Is there a simple way to tell SpaCy to ignore stop words when using .similarity method?

Question

So right now I have a really simple program that will take a sentence and find the sentence in a given book that is most semantically similar and prints out that sentence along with the next few sentences.

import spacy
nlp = spacy.load('en_core_web_lg')

#load alice in wonderland
from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers
text = strip_headers(load_etext(11)).strip()

alice = nlp(text)

sentences = list(alice.sents)

mysent = nlp(unicode("example sentence, could be whatever"))

best_match = None
best_similarity_value = 0
for sent in sentences:
    similarity = sent.similarity(mysent)
    if similarity > best_similarity_value:
        best_similarity_value = similarity
        best_match = sent

print sentences[sentences.index(best_match):sentences.index(best_match)+10]

I want to get better results by telling SpaCy to ignore the stop words when doing this process, but I don't know the best way to go about this. Like I could create a new blank list and append each word that isn't a stop word to the list

for sentence in sentences:
    for word in sentence:
        if word.is_stop == 'False':
            newlist.append(word)

but I would have to make it more complicated than the code above because I would have to keep the integrity of the original list of sentences (because the indexes would have to be the same if I wanted to print out the full sentences later). Plus if I did it this way, I would have to run this new list of lists back through SpaCy in order to use the .similarity method.

I feel like there must be a better way of going about this, and I'd really appreciate any guidance. Even if there isn't a better way than appending each non-stop word to a new list, I'd appreciate any help in creating a list of lists so that the indexes will be identical to the original "sentences" variable.

Thanks so much!

gdaras · Accepted Answer

What you need to do is to overwrite the way spaCy computes similarity.

For similarity computation, spaCy firsts computes a vector for each doc by averaging the vectors of each token (token.vector attribute) and then performs cosine similarity by doing:

return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

You have to tweak this a bit and not take into account vectors of stop words.

The following code should work for you:

import spacy
from spacy.lang.en import STOP_WORDS
import numpy as np
nlp = spacy.load('en_core_web_lg')
doc1 = nlp("This is a sentence")
doc2 = nlp("This is a baby")

def compute_similarity(doc1, doc2):
    vector1 = np.zeros(300)
    vector2 = np.zeros(300)
    for token in doc1:
        if (token.text not in STOP_WORDS):
            vector1 = vector1 + token.vector
    vector1 = np.divide(vector1, len(doc1))
    for token in doc2:
        if (token.text not in STOP_WORDS):
            vector2 = vector2 + token.vector
    vector2 = np.divide(vector2, len(doc2))
    return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

print(compute_similarity(doc1, doc2)))

Hope it helps!

Is there a simple way to tell SpaCy to ignore stop words when using .similarity method?

Answers (2)

Related Questions