omri
omri

Reputation: 384

Keras pad_sequences failing even after tokenizing

I tokenized my dataframes text content like so:

tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(tweets_df['content'])
tweets_df['content'] = tokenizer.texts_to_sequences(tweets_df['content'])

Then tried to pad sequences:

X_train = tf.keras.preprocessing.sequence.pad_sequences(X_train,
                                                             maxlen=MAX_LENGTH,
                                                             dtype='int32',
                                                             padding='post',
                                                             truncating='post')

Fails with: invalid literal for int() with base 10: 'content'

Tried to find the items that weren't integers:

for arr in X_test['content']:
  for num in arr:
    if (isinstance(num, int)==False):
      print(num)

But this didn't return anything. What am I missing?

Upvotes: 1

Views: 560

Answers (1)

user11530462
user11530462

Reputation:

Looks like error is because you are trying to convert something into int which can't be converted to int. Please take a look at sample working solution

import pandas as pd
cars = {'Brand': [' Hero Honda Civic','Toyota Corolla','Ford Focus','Audi A4 A3 A2 A1']}
df = pd.DataFrame(cars, columns = ['Brand'])

#Tokenize the text
import tensorflow as tf
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(df['Brand'])
df['Brand'] = tokenizer.texts_to_sequences(df['Brand'])

Perform the padding

sequence= df['Brand']
MAX_LENGTH = 5
tf.keras.preprocessing.sequence.pad_sequences(sequence, 
                                              maxlen=MAX_LENGTH, 
                                              dtype='int32',
                                              padding='post',
                                              truncating='post' )

array([[ 1,  2,  3,  0,  0],
       [ 4,  5,  0,  0,  0],
       [ 6,  7,  0,  0,  0],
       [ 8,  9, 10, 11, 12]], dtype=int32)

Upvotes: 1

Related Questions