tensa11
tensa11

Reputation: 113

Paragraph embedding with ELMo

I'm trying to understand how to prepare paragraphs for ELMo vectorization.

The docs only show how to embed multiple sentences/words at the time.

eg.

sentences = [["the", "cat", "is", "on", "the", "mat"],
         ["dogs", "are", "in", "the", "fog", ""]]
elmo(
     inputs={
          "tokens": sentences,
          "sequence_len": [6, 5]
            },
     signature="tokens",
     as_dict=True
    )["elmo"]

As I understand, this will return 2 vectors each representing a given sentence. How would I go about preparing input data to vectorize a whole paragraph containing multiple sentences. Note that I would like to use my own preprocessing.

Can this be done like so?

sentences = [["<s>" "the", "cat", "is", "on", "the", "mat", ".", "</s>", 
              "<s>", "dogs", "are", "in", "the", "fog", ".", "</s>"]]

or maybe like so?

sentences = [["the", "cat", "is", "on", "the", "mat", ".", 
              "dogs", "are", "in", "the", "fog", "."]]

Upvotes: 3

Views: 1508

Answers (1)

al0
al0

Reputation: 308

ELMo produces contextual word vectors. So the word vector corresponding to a word is a function of the word and the context, e.g., sentence, it appears in.

Like your example from the docs, you want your paragraph to be a list of sentences, which are lists of tokens. So your second example. To get this format, you could use the spacy tokenizer

import spacy

# you need to install the language model first. See spacy docs.
nlp = spacy.load('en_core_web_sm')

text = "The cat is on the mat. Dogs are in the fog."
toks = nlp(text)
sentences = [[w.text for w in s] for s in toks.sents]

I don't think you need the extra padding "" on the second sentence as sequence_len takes care of this.

Update:

As I understand, this will return 2 vectors each representing a given sentence

No, this will return a vector for each word, in each sentence. If you want the whole paragraph to be the context (for each word), just change it to

sentences = [["the", "cat", "is", "on", "the", "mat", "dogs", "are", "in", "the", "fog"]]

and

...
"sequence_len": [11]

Upvotes: 1

Related Questions