Reputation: 103
I'm working with the sentiment140 dataset to try and learn sentiment analysis using RNNs. I found this tutorial online that uses the keras.imdb
datasource, but I want to try and use my own datasource, so I have tried to adapt the code my own data.
Tutorial: https://towardsdatascience.com/a-beginners-guide-on-sentiment-analysis-with-rnn-9e100627c02e
The data preprocessing involves extracting series data and then tokenizing and padding it before sending it to the model for training. I performed these operations below, in my code but whenever I try to run the training I get if isinstance(data[0], list):IndexError: list index out of range
. I did not define data
so this leads me to believe that I did something that keras or tensorflow did not like. Any ideas as to what is causing this error?
My data is currently in a csv file format with the headers being SENTIMENT
and TEXT
. SENTIMENT
is 0
for negative and 1
for positive. TEXT
is the processed tweet that was collected. Here is a sample.
Dataset CSV (Only a view lines to save space)
SENTIMENT,TEXT
0,about to file tax
0,ahh i hate dogs
1,My paycheck came in today
1,lot to do before chi this weekend
1,lol love food
Code
import pandas as pd
import keras
import keras.preprocessing.text as kpt
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import json
import numpy as np
# Load in DS
df = pd.read_csv('./train.csv')
print(df.head())
#Create sequence
vocabulary_size = 1000
tokenizer = Tokenizer(num_words= vocabulary_size, split=' ')
tokenizer.fit_on_texts(df['TEXT'].values)
X_train = tokenizer.texts_to_sequences(df['TEXT'].values)
#Pad Sequence
X_train = pad_sequences(X_train)
print(X_train)
#Get Sentiment
y_train = df['SENTIMENT'].tolist()
#create model
max_words = 24
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
embedding_size=32
model=Sequential()
model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
batch_size = 64
num_epochs = 3
X_valid, y_valid = X_train[:batch_size], y_train[:batch_size]
X_train2, y_train2 = X_train[batch_size:], y_train[batch_size:]
model.fit(X_train2, y_train2,
validation_data=(X_valid, y_valid),
batch_size=batch_size,
epochs=num_epochs)
Output
Using TensorFlow backend.
SENTIMENT TEXT
0 0 aww that be bummer You shoulda get david carr ...
1 0 be upset that he can not update his facebook b...
2 0 I dive many time for the ball manage to save t...
3 0 my whole body feel itchy and like its on fire
4 0 no it be not behave at all be mad why be here ...
[[ 0 0 0 ... 3 10 5]
[ 0 0 0 ... 46 47 89]
[ 0 0 0 ... 29 9 96]
...
[ 0 0 0 ... 30 309 310]
[ 0 0 0 ... 0 0 72]
[ 0 0 0 ... 33 312 313]]
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 24, 32) 32000
_________________________________________________________________
lstm_1 (LSTM) (None, 100) 53200
_________________________________________________________________
dense_1 (Dense) (None, 1) 101
=================================================================
Total params: 85,301
Trainable params: 85,301
Non-trainable params: 0
_________________________________________________________________
None
Traceback (most recent call last):
File "mcve.py", line 50, in <module>
epochs=num_epochs)
File "/home/dv/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training.py", line 950, in fit
batch_size=batch_size)
File "/home/dv/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training.py", line 787, in _standardize_user_data
exception_prefix='target')
File "/home/dv/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training_utils.py", line 79, in standardize_input_data
if isinstance(data[0], list):
IndexError: list index out of range
JUPYTER NOTEBOOK ERROR
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-25-184505b70981> in <module>()
20 model.fit(X_train2, y_train2,
21 batch_size=batch_size,
---> 22 epochs=num_epochs)
23
~/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
948 sample_weight=sample_weight,
949 class_weight=class_weight,
--> 950 batch_size=batch_size)
951 # Prepare validation data.
952 do_validation = False
~/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training.py in _standardize_user_data(self, x, y, sample_weight, class_weight, check_array_lengths, batch_size)
785 feed_output_shapes,
786 check_batch_axis=False, # Don't enforce the batch size.
--> 787 exception_prefix='target')
788
789 # Generate sample-wise weight values given the `sample_weight` and
~/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training_utils.py in standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix)
77 'for each key in: ' + str(names))
78 elif isinstance(data, list):
---> 79 if isinstance(data[0], list):
80 data = [np.asarray(d) for d in data]
81 elif len(names) == 1 and isinstance(data[0], (float, int)):
IndexError: list index out of range
Upvotes: 1
Views: 2336
Reputation: 3790
Edit
My former suggestion is wrong. I've checked your code and run it, and it works without errors for me.
Then I've looked at the source code, standardize_input_data
function. There's a line which checks a data
argument:
def standardize_input_data(data,
names,
shapes=None,
check_batch_axis=True,
exception_prefix=''):
"""Normalizes inputs and targets provided by users.
Users may pass data as a list of arrays, dictionary of arrays,
or as a single array. We normalize this to an ordered list of
arrays (same order as `names`), while checking that the provided
arrays have shapes that match the network's expectations.
# Arguments
data: User-provided input data (polymorphic).
...
At the line 79:
elif isinstance(data, list):
if isinstance(data[0], list):
...
So, it looks like in case of error an input data is list
, but a list of zero length.
A standartize_input_data
function is called inside Model.fit(...) method throught a call to Model._standardize_user_data(...). Through this chain of functions, passed data
argument gets a value of x
argument of Model.fit(x, y, ...)
. So, I guess is that the problem with type or content of X_train2
or X_valid
. Would you provide X_train2
and X_val
in addition to X_train
content?
Old wrong suggestion
You should increase vocabulary size by one to deal with out-of-vocabulary tokens, I guess.
I.e, change initialization of the Embedding
layer:
model.add(Embedding(vocabulary_size + 1, embedding_size, input_length=max_words))
According to the docs, "input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1".
You may check a max. value of the max(X_train)
(edited).
Hope it helps!
Upvotes: 3