Core question : Right way(s) of using word-embeddings to represent text ? I am building sentiment classification application for tweets. Classify tweets as - negative, neutral and positive. I am doing this using Keras on top of theano and using word-embeddings (google's word2vec or Stanfords GloVe). To represent tweet text I have done as follows: used a pre-trained model (such as word2vec-twitter model) [ M ] to map words to their embeddings. Use the words in the text to query M to get corresponding vectors. So if the tweet ( T ) is "Hello world" and M gives vectors V1 and V2 for the words 'Hello' and 'World'. The tweet T can then be represented ( V ) as either V1 + V2 (add vectors) or V1V2 (concatinate vectors)[These are 2 different strategies] [Concatenation means juxtaposition, so if V1, V2 are d-dimension vectors, in my example T is 2d dimension vector] Then, the tweet T is represented by vector V . If I follow the above, then My Dataset is nothing but vectors (which are sum or concatenation of word vectors depending on which strategy I use). I am training a deepnet such as FFN, LSTM on this dataset. But my results arent coming out to be great. Is this the right way to use word-embeddings to represent text ? What are the other better ways ? Your feedback/critique will be of immense help.

machine-learningnlpword2vecword-embedding

Reputation: 6562

Am I using word-embeddings correctly?

Core question : Right way(s) of using word-embeddings to represent text ?

I am building sentiment classification application for tweets. Classify tweets as - negative, neutral and positive. I am doing this using Keras on top of theano and using word-embeddings (google's word2vec or Stanfords GloVe).

To represent tweet text I have done as follows:

used a pre-trained model (such as word2vec-twitter model) [M] to map words to their embeddings.
Use the words in the text to query M to get corresponding vectors. So if the tweet (T) is "Hello world" and M gives vectors V1 and V2 for the words 'Hello' and 'World'.
The tweet T can then be represented (V) as either V1+V2 (add vectors) or V1V2 (concatinate vectors)[These are 2 different strategies] [Concatenation means juxtaposition, so if V1, V2 are d-dimension vectors, in my example T is 2d dimension vector]
Then, the tweet T is represented by vector V.

If I follow the above, then My Dataset is nothing but vectors (which are sum or concatenation of word vectors depending on which strategy I use). I am training a deepnet such as FFN, LSTM on this dataset. But my results arent coming out to be great.

Is this the right way to use word-embeddings to represent text ? What are the other better ways ?

Your feedback/critique will be of immense help.

Upvotes: 2

Answers (2)

Siddharth Shakya

Reputation: 129

Summing them doesn't make any sense to be honest, because on summing them you get another vector which i don't think represents the semantics of "Hello World" or may be it does but it won't surely hold true for longer sentences in general

Instead it would be better to feed them as sequence as in that way it at least preserves sequence in meaningful way which seems to fit more to your problem.

e.g A hates apple Vs Apple hates A this difference would be captured when you feed them as sequence into RNN but their summation will be same. I hope you get my point!

Upvotes: 0

Lemm Ras

Reputation: 1032

I think that, for your purpose, it is better to think about another way of composing those vectors. The literature on word embeddings contains examples of criticisms to these kinds of composition (I will edit the answer with the correct references as soon as I find them).

I would suggest you to consider also other possible approaches, for instance:

Using the single word vectors as input to your net (I do not know your architecture, but the LSTM is recurrent so it can deal with sequences of words).
Using a full paragraph embedding (i.e. https://cs.stanford.edu/~quocle/paragraph_vector.pdf)

Upvotes: 1

Am I using word-embeddings correctly?

Answers (2)

Related Questions