Reputation: 381
So right now I have a really simple program that will take a sentence and find the sentence in a given book that is most semantically similar and prints out that sentence along with the next few sentences.
import spacy
nlp = spacy.load('en_core_web_lg')
#load alice in wonderland
from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers
text = strip_headers(load_etext(11)).strip()
alice = nlp(text)
sentences = list(alice.sents)
mysent = nlp(unicode("example sentence, could be whatever"))
best_match = None
best_similarity_value = 0
for sent in sentences:
similarity = sent.similarity(mysent)
if similarity > best_similarity_value:
best_similarity_value = similarity
best_match = sent
print sentences[sentences.index(best_match):sentences.index(best_match)+10]
I want to get better results by telling SpaCy to ignore the stop words when doing this process, but I don't know the best way to go about this. Like I could create a new blank list and append each word that isn't a stop word to the list
for sentence in sentences:
for word in sentence:
if word.is_stop == 'False':
newlist.append(word)
but I would have to make it more complicated than the code above because I would have to keep the integrity of the original list of sentences (because the indexes would have to be the same if I wanted to print out the full sentences later). Plus if I did it this way, I would have to run this new list of lists back through SpaCy in order to use the .similarity method.
I feel like there must be a better way of going about this, and I'd really appreciate any guidance. Even if there isn't a better way than appending each non-stop word to a new list, I'd appreciate any help in creating a list of lists so that the indexes will be identical to the original "sentences" variable.
Thanks so much!
Upvotes: 3
Views: 3321
Reputation: 8558
Here's a slightly more elegant solution: we're going to override how spacy calculates document vectors under-the-hood, which will propagate this customization to any downstream pipeline components like the TextCategorizer or whatever.
This is based on the documentation found here: https://spacy.io/usage/processing-pipelines#custom-components-user-hooks
This solution was designed around loading pre-trained embeddings. In lieu of referencing a list of stopwords directly, I'm just going to assume that anything that's out-of-vocab for my loaded embeddings is a token I want to ignore in my document vector calculation.
class FancyDocumentVectors(object):
def __call__(self, doc):
doc.user_hooks["vector"] = self.vector
return doc
def vector(self, doc):
"""
Constrain attention to non-zero vectors.
Returns concatenation of mean and max pooling
"""
# This is the part where we filter out stop words
# (really any token for which we couldn't calculate a vector representation).
# If you'd rather invoke a stopword list, change the line below to something like:
# doc_vecs = np.array([t.vector for t in doc if t in STOPWORDS])
doc_vecs = np.array([t.vector for t in doc if t.has_vector])
if sum(doc_vecs.shape) == 0:
doc_vecs = np.array([doc[0].vector])
mean_pooled = doc_vecs.mean(axis=0)
# Because I'm fancy, I'm going to augment my custom document vector with
# some additional information. For a demonstration of the value of this
# approach, reference the SWEM paper: https://arxiv.org/abs/1805.09843
max_pooled = doc_vecs.max(axis=0)
doc_vec = np.hstack([mean_pooled, max_pooled])
return doc_vec
# If you're not into it, just return mean_pooled instead.
# return mean_pooled
nlp.add_pipe(FancyDocumentVectors())
Here's a concrete example using vectors trained on stackoverflow!
First, we load our pretrained embeddings into an empty language model.
import spacy
from gensim.models.keyedvectors import KeyedVectors
# https://github.com/vefstathiou/SO_word2vec
word_vect = KeyedVectors.load_word2vec_format("SO_vectors_200.bin", binary=True)
nlp = spacy.blank('en')
nlp.vocab.vectors = spacy.vocab.Vectors(data=word_vect.syn0, keys=word_vect.index2word)
Default behavior before changing anything:
doc = nlp("This is a question about spacy.")
for token in doc:
print(token, token.vector_norm, token.vector.sum())
print(doc.vector_norm, doc.vector.sum())
# This 0.0 0.0
# is 0.0 0.0
# a 0.0 0.0
# question 25.44337 -41.958717
# about 0.0 0.0
# spacy 13.833485 -6.3489656
# . 0.0 0.0
# 4.353660220883036 -6.901098
Modified behavior after overriding document vector calculation:
# MAGIC!
nlp.add_pipe(FancyDocumentVectors())
doc = nlp("This is a question about spacy.")
for token in doc:
print(token, token.vector_norm, token.vector.sum())
print(doc.vector_norm, doc.vector.sum())
# This 0.0 0.0
# is 0.0 0.0
# a 0.0 0.0
# question 25.44337 -41.958717
# about 0.0 0.0
# spacy 13.833485 -6.3489656
# . 0.0 0.0
# 24.601780061609414 109.74769
Upvotes: 1
Reputation: 10119
What you need to do is to overwrite the way spaCy computes similarity.
For similarity computation, spaCy firsts computes a vector for each doc by averaging the vectors of each token (token.vector attribute) and then performs cosine similarity by doing:
return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
You have to tweak this a bit and not take into account vectors of stop words.
The following code should work for you:
import spacy
from spacy.lang.en import STOP_WORDS
import numpy as np
nlp = spacy.load('en_core_web_lg')
doc1 = nlp("This is a sentence")
doc2 = nlp("This is a baby")
def compute_similarity(doc1, doc2):
vector1 = np.zeros(300)
vector2 = np.zeros(300)
for token in doc1:
if (token.text not in STOP_WORDS):
vector1 = vector1 + token.vector
vector1 = np.divide(vector1, len(doc1))
for token in doc2:
if (token.text not in STOP_WORDS):
vector2 = vector2 + token.vector
vector2 = np.divide(vector2, len(doc2))
return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
print(compute_similarity(doc1, doc2)))
Hope it helps!
Upvotes: 2