Reputation: 384
I tokenized my dataframes text content like so:
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(tweets_df['content'])
tweets_df['content'] = tokenizer.texts_to_sequences(tweets_df['content'])
Then tried to pad sequences:
X_train = tf.keras.preprocessing.sequence.pad_sequences(X_train,
maxlen=MAX_LENGTH,
dtype='int32',
padding='post',
truncating='post')
Fails with: invalid literal for int() with base 10: 'content'
Tried to find the items that weren't integers:
for arr in X_test['content']:
for num in arr:
if (isinstance(num, int)==False):
print(num)
But this didn't return anything. What am I missing?
Upvotes: 1
Views: 560
Reputation:
Looks like error is because you are trying to convert something into int which can't be converted to int. Please take a look at sample working solution
import pandas as pd
cars = {'Brand': [' Hero Honda Civic','Toyota Corolla','Ford Focus','Audi A4 A3 A2 A1']}
df = pd.DataFrame(cars, columns = ['Brand'])
#Tokenize the text
import tensorflow as tf
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(df['Brand'])
df['Brand'] = tokenizer.texts_to_sequences(df['Brand'])
Perform the padding
sequence= df['Brand']
MAX_LENGTH = 5
tf.keras.preprocessing.sequence.pad_sequences(sequence,
maxlen=MAX_LENGTH,
dtype='int32',
padding='post',
truncating='post' )
array([[ 1, 2, 3, 0, 0],
[ 4, 5, 0, 0, 0],
[ 6, 7, 0, 0, 0],
[ 8, 9, 10, 11, 12]], dtype=int32)
Upvotes: 1