Reputation: 113
I'm trying to understand how to prepare paragraphs for ELMo vectorization.
The docs only show how to embed multiple sentences/words at the time.
eg.
sentences = [["the", "cat", "is", "on", "the", "mat"],
["dogs", "are", "in", "the", "fog", ""]]
elmo(
inputs={
"tokens": sentences,
"sequence_len": [6, 5]
},
signature="tokens",
as_dict=True
)["elmo"]
As I understand, this will return 2 vectors each representing a given sentence. How would I go about preparing input data to vectorize a whole paragraph containing multiple sentences. Note that I would like to use my own preprocessing.
Can this be done like so?
sentences = [["<s>" "the", "cat", "is", "on", "the", "mat", ".", "</s>",
"<s>", "dogs", "are", "in", "the", "fog", ".", "</s>"]]
or maybe like so?
sentences = [["the", "cat", "is", "on", "the", "mat", ".",
"dogs", "are", "in", "the", "fog", "."]]
Upvotes: 3
Views: 1508
Reputation: 308
ELMo produces contextual word vectors. So the word vector corresponding to a word is a function of the word and the context, e.g., sentence, it appears in.
Like your example from the docs, you want your paragraph to be a list of sentences, which are lists of tokens. So your second example. To get this format, you could use the spacy
tokenizer
import spacy
# you need to install the language model first. See spacy docs.
nlp = spacy.load('en_core_web_sm')
text = "The cat is on the mat. Dogs are in the fog."
toks = nlp(text)
sentences = [[w.text for w in s] for s in toks.sents]
I don't think you need the extra padding ""
on the second sentence as sequence_len
takes care of this.
Update:
As I understand, this will return 2 vectors each representing a given sentence
No, this will return a vector for each word, in each sentence. If you want the whole paragraph to be the context (for each word), just change it to
sentences = [["the", "cat", "is", "on", "the", "mat", "dogs", "are", "in", "the", "fog"]]
and
...
"sequence_len": [11]
Upvotes: 1