Reputation: 470

Python: LSTM model and word embedding

My problem is mainly theoretical. I would like to use an LSTM model to classify the sentiment of sentences in this way 1 = positive, 0 = neutral and -1 = negative. I have a bag of word (BOW) that I would like to use to train the model. BOW is dataframe with two columns like this:

Text            | Sentiment
hello dear...        1
I hate you...       -1
...                 ...

According to the example proposed by keras I should transform the sentences of the 'Text' column of my BOW into numerical vectors where each number represents a word of the vocabulary.

Now my questions is how do I turn my sentences into vectors of numbers and what are the best techniques to do it?

For now my code is this, what am i doing wrong?

model = Sequential()
model.add(LSTM(units=50)) 
model.add(Dense(2, activation='softmax')) # 2 because I have 3 classes
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['Sentiment'], test_size=0.3, random_state=1) #Sentiment maiuscolo per altro dataframe

clf = model.fit(X_train, y_train)
predicted = clf.predict(X_test)
print(predicted)

Upvotes: 0

Answers (2)

hola

Reputation: 612

You should first create an index of you vocabulary, i.e. assign an index to each token in your. And then transform to a numeric form by replacing each token in the text by its corresponding index. Your model should be then:

model = Sequential()
model.add(Embedding(len(vocab), 64, input_length=sent_len)
model.add(LSTM(units=50)) 
model.add(Dense(3, activation='softmax'))

Note that you need to pad you sentences to a common length before feeding them to the network. You can use np.pad to do so.

An other alternative is to used pre-trained word embeddings, you can download them from fastText

P.S. You are miss using the BOW, however BOW is a good baseline model you can use for sentiment analysis.

Upvotes: 1

Quinn Lanners

Reputation: 46

First of all, as Marat commented, you are not using the term Bag of Words (BOW) correctly here. What you are claiming to be your BOW is simply just a labeled dataset of sentences. While there are a lot of questions here, I will try to answer the first one on how to convert your sentences into vectors that can be used in an LSTM model.

The most basic way to do this is to create one-hot-encoding vectors for each word in each sentence. To create these, you first need to iterate through your dataset and assign a unique index to each word. So for example:

vocab = 
{ 'hello': 0,
  'dear': 1,
   .
   .
   .
  'hate': 999}

Once you have this dictionary created, you can then go through each sentence and assign each word in each sentence a vector of len(vocab) with zeros at every index except for the index corresponding to that word. For example, using vocab above, dear would look like: [0,1,0,0,0,...,0,0].

The pros of one-hot-encoding vectors is that they are easy to create, and pretty simple to work with. The downside is that you can pretty quickly be working with super high dimension vectors if you have a large vocabulary. That's where word embeddings come into play, and honestly are the superior route to one-hot-encoding vectors. However, they are a bit more complex and harder to understand what exactly they are doing behind the scenes. You can read more about that here if you want: https://towardsdatascience.com/what-the-heck-is-word-embedding-b30f67f01c81

Upvotes: 1

Python: LSTM model and word embedding

Answers (2)

Related Questions