Reputation: 840
I am kind of new to deep learning and I have been trying to create a simple sentiment analyzer using deep learning methods for natural language processing and using the Reuters dataset. Here is my code:
import numpy as np
from keras.datasets import reuters
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense, Dropout, GRU
from keras.utils import np_utils
max_length=3000
vocab_size=100000
epochs=10
batch_size=32
validation_split=0.2
(x_train, y_train), (x_test, y_test) = reuters.load_data(path="reuters.npz",
num_words=vocab_size,
skip_top=5,
maxlen=None,
test_split=0.2,
seed=113,
start_char=1,
oov_char=2,
index_from=3)
tokenizer = Tokenizer(num_words=max_length)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')
y_train = np_utils.to_categorical(y_train, 50)
y_test = np_utils.to_categorical(y_test, 50)
model = Sequential()
model.add(GRU(50, input_shape = (49,1), return_sequences = True))
model.add(Dropout(0.2))
model.add(Dense(256, input_shape=(max_length,), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(50, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()
history = model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size, validation_split=validation_split)
score = model.evaluate(x_test, y_test)
print('Test Accuracy:', round(score[1]*100,2))
What I do not understand is why, every time I try to use a GRU or LSTM cell instead of a Dense one, I get this error:
ValueError: Error when checking input: expected gru_1_input to have 3 dimensions, but got array with shape (8982, 3000)
I have seen online that adding return_sequences = True
could solve the issue, but as you can see, the issue remains in my case.
What should I do in this case?
Upvotes: 1
Views: 1950
Reputation: 33460
The problem is that the shape of x_train
is (8982, 3000)
so it means that (considering the preprocessing stage) there are 8982 sentences encoded as one-hot vectors with vocab size of 3000. On the other hand a GRU (or LSTM) layer accepts a sequence as input and therefore its input shape should be (batch_size, num_timesteps or sequence_length, feature_size)
. Currently the features you have are the presence (1) or absence (0) of a particular word in a sentence. So to make it work with GRU you need to add a third dimension to x_train
and x_test
:
x_train = np.expand_dims(x_train, axis=-1)
x_test = np.expand_dims(x_test, axis=-1)
and then remove that return_sequences=True
and change the input shape of GRU to input_shape=(3000,1)
. This way you are telling the GRU layer that you are processing sequences of length 3000 where each element consists of one single feature. (As a side note I think you should pass the vocab_size
to num_words
argument of Tokenizer
. That indicates the number of words in vocabulary. Instead, pass max_length
to maxlen
argument of load_data
which limits the length of a sentence.)
However, I think you may get better results if you use an Embedding layer as the first layer and before the GRU layer. That's because currently the way you encode sentences does not take into account the order of words in a sentence (it just cares about their existence). Therefore, feeding GRU or LSTM layers, which relies on the order of elements in a sequence, with this representation does not make sense.
Upvotes: 2