Reputation: 33
I try to implement sequence model (trained to predict next word) built on one-hot encoded vector sequences. My custom one-hot encoder works well. But just as exercise I want to do all things with tensorflow (inspired by Deep Learning with Python, chapter 11, which I can reproduce, but not with my data. One difference: in book tensorflow.dataset is used, not DataFrame). Let we have input data 'df':
df = pd.DataFrame(data=[['I', 'have', 'an', 'idea','xyz'],
['This', 'idea', 'is', 'awesome', 'asd']],
columns=['cell_0_0','cell_0_1','cell_1_0','cell_1_1','next_word'])
First of all:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Input, Dense, LSTM, TextVectorization
from tensorflow.keras.optimizers import RMSprop
import numpy as np
import pandas as pd
Let's prepare data for the model.
Corpus = df.to_numpy().flatten().tolist()
Vocab = sorted(list(set(Corpus)))
Vocab.insert(0, '[UNK]')
Vocab.insert(0, '')
print(Vocab)
Output: ['', '[UNK]', 'I', 'This', 'an', 'asd', 'awesome', 'have', 'idea', 'is', 'xyz']
X = [[' '.join(Row[:-1])] for Row in df.to_numpy()]
Y = [[Row[-1]] for Row in df.to_numpy()]
X_vectorizer = TextVectorization(max_tokens=len(Vocab), output_mode="int", output_sequence_length=df.shape[1]-1, vocabulary=Vocab, standardize=None)
Y_vectorizer = TextVectorization(max_tokens=len(Vocab), output_mode="int", output_sequence_length=1, vocabulary=Vocab, standardize=None)
X_vectorized = X_text_vectorization(X)
Y_vectorized = Y_text_vectorization(Y)
print(X_vectorized.shape, Y_vectorized.shape)
Output: (2, 4) (2, 1)
print(X_vectorized[0], Y_vectorized[0])
Output: tf.Tensor([2 7 4 8], shape=(4,), dtype=int64) tf.Tensor([10], shape=(1,), dtype=int64)
So data is correctly vectorized.
Then I build the model:
inputs = keras.Input(shape=(None,), dtype="int64", name="input_layer")
embedded = tf.one_hot(inputs, depth=len(Vocab))
layer1 = LSTM(32)(embedded)
outputs = Dense(len(Vocab), activation='softmax')(layer1)
model = keras.Model(inputs, outputs)
model.compile(loss='categorical_crossentropy', optimizer='RMSprop')
model.fit(x=X_vectorized.numpy()
,y=Y_vectorized
,batch_size=256
,epochs=55
)
Please, correct the built model or/and parameters in fit method. As a current version generates ValueError: Shapes (None, 1) and (None, 11) are incompatible. Is such 'dataflow' is correct or is it a wrong approach? Should data be vectorized and the one_hot encoded right in a model?
Upvotes: 0
Views: 270