x3mEr
x3mEr

Reputation: 33

How to vectorize text data in Pandas.DataFrame and then one_hot encoode it "inside" the model

I try to implement sequence model (trained to predict next word) built on one-hot encoded vector sequences. My custom one-hot encoder works well. But just as exercise I want to do all things with tensorflow (inspired by Deep Learning with Python, chapter 11, which I can reproduce, but not with my data. One difference: in book tensorflow.dataset is used, not DataFrame). Let we have input data 'df':

df = pd.DataFrame(data=[['I', 'have', 'an', 'idea','xyz'],
                         ['This', 'idea', 'is', 'awesome', 'asd']],
                   columns=['cell_0_0','cell_0_1','cell_1_0','cell_1_1','next_word'])

First of all:

import tensorflow as tf
from tensorflow import keras 
from tensorflow.keras.layers import Input, Dense, LSTM, TextVectorization
from tensorflow.keras.optimizers import RMSprop
import numpy as np
import pandas as pd

Let's prepare data for the model.

  1. Get text corpus and vocabulary:
Corpus = df.to_numpy().flatten().tolist()
Vocab  = sorted(list(set(Corpus)))
Vocab.insert(0, '[UNK]')
Vocab.insert(0, '')
print(Vocab)

Output: ['', '[UNK]', 'I', 'This', 'an', 'asd', 'awesome', 'have', 'idea', 'is', 'xyz']

  1. Split 'df' into X and Y. As I want to use tensorflow.TextVoctorizer, X should be a vector of text, i.e. the dimension of X should be 2 (or 1?):
X = [[' '.join(Row[:-1])] for Row in df.to_numpy()]
Y = [[Row[-1]] for Row in df.to_numpy()]
  1. Then vectorize input data:
X_vectorizer = TextVectorization(max_tokens=len(Vocab), output_mode="int", output_sequence_length=df.shape[1]-1, vocabulary=Vocab, standardize=None)
Y_vectorizer = TextVectorization(max_tokens=len(Vocab), output_mode="int", output_sequence_length=1, vocabulary=Vocab, standardize=None)

X_vectorized = X_text_vectorization(X)
Y_vectorized = Y_text_vectorization(Y)
print(X_vectorized.shape, Y_vectorized.shape)

Output: (2, 4) (2, 1)

print(X_vectorized[0], Y_vectorized[0])

Output: tf.Tensor([2 7 4 8], shape=(4,), dtype=int64) tf.Tensor([10], shape=(1,), dtype=int64)

So data is correctly vectorized.

Then I build the model:

inputs = keras.Input(shape=(None,), dtype="int64", name="input_layer")
embedded = tf.one_hot(inputs, depth=len(Vocab))
layer1 = LSTM(32)(embedded)
outputs = Dense(len(Vocab), activation='softmax')(layer1)

model = keras.Model(inputs, outputs)
model.compile(loss='categorical_crossentropy', optimizer='RMSprop')
model.fit(x=X_vectorized.numpy()
         ,y=Y_vectorized
         ,batch_size=256
         ,epochs=55
         )

Please, correct the built model or/and parameters in fit method. As a current version generates ValueError: Shapes (None, 1) and (None, 11) are incompatible. Is such 'dataflow' is correct or is it a wrong approach? Should data be vectorized and the one_hot encoded right in a model?

Upvotes: 0

Views: 270

Answers (0)

Related Questions