faysal
faysal

Reputation: 162

Getting Error while classifying text data using word2vec

I want to use my own word dataset for creating the embeddings. And use my own label data for training and testing my model. For that I have already created my own word embeddings using word2vec. And facing problem in training my model with label data.

I am getting error while trying to train model. My model creation code:

# create the tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
encoded_docs = tokenizer.texts_to_sequences(X_train)

max_length = max([len(s.split()) for s in X_train])
X_train = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_test)
encoded_docs = tokenizer.texts_to_sequences(X_test)

X_test = pad_sequences(encoded_docs, maxlen=max_length, padding='post')


# setup the embedding layer
embeddings = Embedding(input_dim=embedding_matrix.shape[0], output_dim=embedding_matrix.shape[1],
                  weights=[embedding_matrix],input_length= max_length, trainable=False)

new_model = Sequential() new_model.add(embeddings)
new_model.add(Conv1D(filters=128, kernel_size=5, activation='relu'))
new_model.add(MaxPooling1D(pool_size=2)) new_model.add(Flatten())
new_model.add(Dense(1, activation='sigmoid'))

And this is how I have created embedding matrix-

embedding_matrix = np.zeros((len(model.wv.vocab), vector_dim))
    for i in range(len(model.wv.vocab)):
        embedding_vector = model.wv[model.wv.index2word[i]]
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

By doing so I am getting the following error-

 WARNING:tensorflow:From /Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:1290: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
Epoch 1/10
Traceback (most recent call last):
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
    return fn(*args)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[27,2] = 1049 is not in [0, 1045)
     [[Node: embedding_1/GatherV2 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_1/embeddings/read, embedding_1/Cast, embedding_1/GatherV2/axis)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/faysal/Desktop/My Computer/D/Code Workspace/Research-IoT/embedding-tut/src/main.py", line 359, in <module>
    custom_keras_model(embedding_matrix, model.wv)
  File "/Users/faysal/Desktop/My Computer/D/Code Workspace/Research-IoT/Collaboration/embedding-tut/src/main.py", line 295, in custom_keras_model
    new_model.fit(X_train, y_train, epochs=10, verbose=2)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/keras/models.py", line 867, in fit
    initial_epoch=initial_epoch)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/keras/engine/training.py", line 1598, in fit
    validation_steps=validation_steps)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/keras/engine/training.py", line 1183, in _fit_loop
    outs = f(ins_batch)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2273, in __call__
    **self.session_kwargs)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[27,2] = 1049 is not in [0, 1045)
     [[Node: embedding_1/GatherV2 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_1/embeddings/read, embedding_1/Cast, embedding_1/GatherV2/axis)]]

Caused by op 'embedding_1/GatherV2', defined at:
  File "/Users/faysal/Desktop/My Computer/D/Code Workspace/Research-IoT/Collaboration/embedding-tut/src/main.py", line 359, in <module>
    custom_keras_model(embedding_matrix, model.wv)
  File "/Users/faysal/Desktop/My Computer/D/Code Workspace/Research-IoT/Collaboration/embedding-tut/src/main.py", line 278, in custom_keras_model
    new_model.add(embeddings)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/keras/models.py", line 442, in add
    layer(x)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/keras/engine/topology.py", line 602, in __call__
    output = self.call(inputs, **kwargs)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/keras/layers/embeddings.py", line 134, in call
    out = K.gather(self.embeddings, inputs)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 1134, in gather
    return tf.gather(reference, indices)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 2736, in gather
    return gen_array_ops.gather_v2(params, indices, axis, name=name)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3065, in gather_v2
    "GatherV2", params=params, indices=indices, axis=axis, name=name)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
    op_def=op_def)
  File "/Users/faysal/anaconda2/envs/python3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): indices[27,2] = 1049 is not in [0, 1045)
     [[Node: embedding_1/GatherV2 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_1/embeddings/read, embedding_1/Cast, embedding_1/GatherV2/axis)]]


Process finished with exit code 1

I am getting error in fitting training data into the model. I think I have mistaken in calculting the training data shape and injecting it into the model.

Upvotes: 0

Views: 203

Answers (1)

ixeption
ixeption

Reputation: 2050

You are using two different Tokenizers and you train them separately on train and test. What happens is, that your tokens do not match for training and test. Your error is caused, because a token occurs (1049) which is not is not in max_length. Even if you fix that, your model will not work, if you have two tokenizers.

What you should do it to fit your Tokenizer on all data (X_train and X_test) and use just one single Tokenizer.

Upvotes: 1

Related Questions