Reputation: 2448
I'm currently learning spaCy, and I have an exercise on word and sentence embeddings. Sentences are stored in a pandas DataFrame columns, and, we're requested to train a classifier based on the vector of these sentences.
I have a dataframe that looks like this:
+---+---------------------------------------------------+
| | sentence |
+---+---------------------------------------------------+
| 0 | "Whitey on the Moon" is a 1970 spoken word poe... |
+---+---------------------------------------------------+
| 1 | St Anselm's Church is a Roman Catholic church ... |
+---+---------------------------------------------------+
| 2 | Nymphargus grandisonae (common name: giant gla... |
+---+---------------------------------------------------+
Next, I apply an NLP function to these sentences:
import en_core_web_md
nlp = en_core_web_md.load()
df['tokenized'] = df['sentence'].apply(nlp)
Now, if I understand correctly, each item in df['tokenized'] has an attribute that returns the vector of the sentence in a 2D array.
print(type(df['tokenized'][0].vector))
print(df['tokenized'][0].vector.shape)
yields
<class 'numpy.ndarray'>
(300,)
How do I add the content of this array (300 rows) as columns to the df
dataframe for the corresponding sentence, ignoring stop words?
Thanks!
Upvotes: 3
Views: 3903
Reputation: 25189
Assume you have list of sentences:
sents = ["'Whitey on the Moon' is a 1970 spoken word"
, "St Anselm's Church is a Roman Catholic church"
, "Nymphargus grandisonae (common name: giant)"]
that you put into a dataframe:
df=pd.DataFrame({"sentence":sents})
print(df)
sentence
0 'Whitey on the Moon' is a 1970 spoken word
1 St Anselm's Church is a Roman Catholic church
2 Nymphargus grandisonae (common name: giant)
Then you may proceed as follows:
df['tokenized'] = df['sentence'].apply(nlp)
df['sent_vectors'] = df['tokenized'].apply(
lambda sent: np.mean([token.vector for token in sent if not token.is_stop])
)
The resulting sent_vectorized
column is a mean of all vector embeddings for tokens that are not stop words (token.is_stop
attribute).
Note 1
What you call a sentence
in your dataframe is actually an instance of a Doc
class.
Note 2 Though you may prefer to go through a pandas dataframe, the recommended way would be through a getter extension:
import spacy
from spacy.tokens import Doc
nlp = spacy.load("en_core_web_md")
sents = ["'Whitey on the Moon' is a 1970 spoken word"
, "St Anselm's Church is a Roman Catholic church"
, "Nymphargus grandisonae (common name: giant)"]
vector_except_stopwords = lambda doc: np.mean([token.vector for token in sent if not token.is_stop])
Doc.set_extension("vector_except_stopwords", getter=vector_except_stopwords)
vecs =[] # for demonstration purposes
for doc in nlp.pipe(sents):
vecs.append(doc._.vector_except_stopwords)
Upvotes: 5
Reputation: 2448
Actually, using a single value averaging all vectors does yield good results in a classification model. What was needed was indeed a dataframe of 300 columns per sentence (since 300 is the standard length of spaCy word embeddings:
So, to continue @Sergey's code:
sents = ["'Whitey on the Moon' is a 1970 spoken word"
, "St Anselm's Church is a Roman Catholic church"
, "Nymphargus grandisonae (common name: giant)"]
df=pd.DataFrame({"sentence":sents})
df['tokenized'] = df['sentence'].apply(nlp)
df['sent_vectors'] = df['tokenized'].apply(lambda x: x.vector)
vectors = 0['sent_vector'].apply(pd.Series)
With this, vectors
contains the features of which a model can be trained. For instance, assuming each sentence has a sentiment attached to it:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X = vectors
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = LogisticRegression()
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
What I couldn't do is to remove stopwords from the DataFrame entries (i.e. remove each Token
object from the Doc
parent object stored in the dataframe where is_stop
is False
.
Upvotes: 1