Reputation: 965
I have the following function which add a new column to my dataframe. I want to use the vectorized text as into my RNN, however, i am not able to reshape the column to use it as input. How can i resolve this? Thanks
# vectorization
max_length = 500
def vectorization(text):
seq = text.split()
if seq:
vectorizer = TfidfVectorizer()
vectorizer.fit(seq)
vector = vectorizer.transform(seq)
return sequence.pad_sequences(vector.toarray(), maxlen=max_length)
else:
print(seq)
return seq
df['text_vector']=df['text_cleaned'].apply(vectorization)
X_train, X_test, Y_train, Y_test = train_test_split(df['text_vector'], df['sentiment'], train_size=0.80, shuffle=True)
X_train = X_train.to_numpy()
X_test = X_test.to_numpy()
Y_train = Y_train.to_numpy()
Y_test = Y_test.to_numpy()
X_train = X_train.reshape((X_train.shape[0], 500, 1))
Error here:
ValueError: cannot reshape array of size 3876 into shape (3876,500,1)
Upvotes: 0
Views: 344
Reputation: 16856
Few points
TfidfVectorizer
on full train text but not per row as you are doingpad_sequences
. So you will have to concatenate all the np arrays rows wise to create a np array of size (n X 500)
where n
is the len(df)
from sklearn.feature_extraction.text import TfidfVectorizer
from keras.preprocessing import sequence
max_length = 500
def vectorization(vectorizer, text):
vector = vectorizer.transform(text)
return sequence.pad_sequences(vector.toarray(), maxlen=max_length)
import pandas as pd
df = pd.DataFrame( {'text_cleaned': [
'a cat on a table',
'a dog under a table',
'apple is red',
'sky is blue'] })
v = TfidfVectorizer()
# Fit on full test data text
v.fit(df['text_cleaned'])
df['text_vector']= df['text_cleaned'].apply(lambda text: vectorization(v, [text]))
# concatenate all the 500 length sequences
x_train = np.concatenate(df['text_vector'])
# reshape or use expand_dim to add last dimention so that it can be passed to RNN
x_train = x_train.reshape(-1,500,1)
Upvotes: 1