Will
Will

Reputation: 381

Is there a simple way to tell SpaCy to ignore stop words when using .similarity method?

So right now I have a really simple program that will take a sentence and find the sentence in a given book that is most semantically similar and prints out that sentence along with the next few sentences.

import spacy
nlp = spacy.load('en_core_web_lg')

#load alice in wonderland
from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers
text = strip_headers(load_etext(11)).strip()

alice = nlp(text)

sentences = list(alice.sents)

mysent = nlp(unicode("example sentence, could be whatever"))

best_match = None
best_similarity_value = 0
for sent in sentences:
    similarity = sent.similarity(mysent)
    if similarity > best_similarity_value:
        best_similarity_value = similarity
        best_match = sent

print sentences[sentences.index(best_match):sentences.index(best_match)+10]

I want to get better results by telling SpaCy to ignore the stop words when doing this process, but I don't know the best way to go about this. Like I could create a new blank list and append each word that isn't a stop word to the list

for sentence in sentences:
    for word in sentence:
        if word.is_stop == 'False':
            newlist.append(word)

but I would have to make it more complicated than the code above because I would have to keep the integrity of the original list of sentences (because the indexes would have to be the same if I wanted to print out the full sentences later). Plus if I did it this way, I would have to run this new list of lists back through SpaCy in order to use the .similarity method.

I feel like there must be a better way of going about this, and I'd really appreciate any guidance. Even if there isn't a better way than appending each non-stop word to a new list, I'd appreciate any help in creating a list of lists so that the indexes will be identical to the original "sentences" variable.

Thanks so much!

Upvotes: 3

Views: 3321

Answers (2)

David Marx
David Marx

Reputation: 8558

Here's a slightly more elegant solution: we're going to override how spacy calculates document vectors under-the-hood, which will propagate this customization to any downstream pipeline components like the TextCategorizer or whatever.

This is based on the documentation found here: https://spacy.io/usage/processing-pipelines#custom-components-user-hooks

This solution was designed around loading pre-trained embeddings. In lieu of referencing a list of stopwords directly, I'm just going to assume that anything that's out-of-vocab for my loaded embeddings is a token I want to ignore in my document vector calculation.

class FancyDocumentVectors(object):
    def __call__(self, doc):
        doc.user_hooks["vector"] = self.vector
        return doc

    def vector(self, doc):
        """
        Constrain attention to non-zero vectors.
        Returns concatenation of mean and max pooling
        """
        # This is the part where we filter out stop words 
        # (really any token for which we couldn't calculate a vector representation).
        # If you'd rather invoke a stopword list, change the line below to something like:
        # doc_vecs = np.array([t.vector for t in doc if t in STOPWORDS])
        doc_vecs = np.array([t.vector for t in doc if t.has_vector])
        if sum(doc_vecs.shape) == 0: 
            doc_vecs = np.array([doc[0].vector])

        mean_pooled = doc_vecs.mean(axis=0)
        
        # Because I'm fancy, I'm going to augment my custom document vector with 
        # some additional information. For a demonstration of the value of this 
        # approach, reference the SWEM paper: https://arxiv.org/abs/1805.09843
        max_pooled = doc_vecs.max(axis=0)
        doc_vec = np.hstack([mean_pooled, max_pooled])
        return doc_vec

        # If you're not into it, just return mean_pooled instead.
        # return mean_pooled

nlp.add_pipe(FancyDocumentVectors())

Here's a concrete example using vectors trained on stackoverflow!

First, we load our pretrained embeddings into an empty language model.

import spacy
from gensim.models.keyedvectors import KeyedVectors

# https://github.com/vefstathiou/SO_word2vec
word_vect = KeyedVectors.load_word2vec_format("SO_vectors_200.bin", binary=True)
nlp = spacy.blank('en')
nlp.vocab.vectors = spacy.vocab.Vectors(data=word_vect.syn0, keys=word_vect.index2word) 

Default behavior before changing anything:

doc = nlp("This is a question about spacy.")
for token in doc:
  print(token, token.vector_norm, token.vector.sum())
print(doc.vector_norm, doc.vector.sum())

# This 0.0 0.0
# is 0.0 0.0
# a 0.0 0.0
# question 25.44337 -41.958717
# about 0.0 0.0
# spacy 13.833485 -6.3489656
# . 0.0 0.0
# 4.353660220883036 -6.901098

Modified behavior after overriding document vector calculation:

# MAGIC!
nlp.add_pipe(FancyDocumentVectors())

doc = nlp("This is a question about spacy.")
for token in doc:
  print(token, token.vector_norm, token.vector.sum())
print(doc.vector_norm, doc.vector.sum())

# This 0.0 0.0
# is 0.0 0.0
# a 0.0 0.0
# question 25.44337 -41.958717
# about 0.0 0.0
# spacy 13.833485 -6.3489656
# . 0.0 0.0
# 24.601780061609414 109.74769

Upvotes: 1

gdaras
gdaras

Reputation: 10119

What you need to do is to overwrite the way spaCy computes similarity.

For similarity computation, spaCy firsts computes a vector for each doc by averaging the vectors of each token (token.vector attribute) and then performs cosine similarity by doing:

return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

You have to tweak this a bit and not take into account vectors of stop words.

The following code should work for you:

import spacy
from spacy.lang.en import STOP_WORDS
import numpy as np
nlp = spacy.load('en_core_web_lg')
doc1 = nlp("This is a sentence")
doc2 = nlp("This is a baby")

def compute_similarity(doc1, doc2):
    vector1 = np.zeros(300)
    vector2 = np.zeros(300)
    for token in doc1:
        if (token.text not in STOP_WORDS):
            vector1 = vector1 + token.vector
    vector1 = np.divide(vector1, len(doc1))
    for token in doc2:
        if (token.text not in STOP_WORDS):
            vector2 = vector2 + token.vector
    vector2 = np.divide(vector2, len(doc2))
    return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

print(compute_similarity(doc1, doc2)))

Hope it helps!

Upvotes: 2

Related Questions