StackPancakes
StackPancakes

Reputation: 284

How do I apply tokenizer.fit_on_texts() to a data frame with two columns of objects/strings I need to train?

I need to pass two sets of data into tokenizer.fit_on_texts(), but having issues with it not recognizing the text. tokenizer.word_index() is returning is the number 2. I suspect the issue is occurring at tokenizer.fit_on_texts() as I am passing it a data frame with (33481, 2) of strings. Most of the examples I have looked at have used the IMBD data set.

Additional information: I'm currently experimenting with multi classification problem where there are headline-article pairs with labels (agree, disagree, discuss, unrelated). I plan to use LSTM and the pre-trained Glove to create an index of words mapped to known embedding.


Data:

f_data -

Here is a snippet of the creation f_data:

# This df will be fed into the fit_on_texts()
# Creating df to contain the train and validation set
f_data = pd.DataFrame(columns = ['Headline', 'articleBody'])
# Adding data from x_train to f_data
f_data['Headline'] = x_train['Headline']
f_data['articleBody'] = x_train['articleBody']
# Appending x_val headline and article body columns
f_data = f_data.append(x_val[['Headline', 'articleBody']])
f_data

Keras/TF code Issue

Issue: I am having issues is that when I print out the length of word_index it returns 2:

tokenizer.fit_on_texts(f_data[['Headline', 'articleBody']]
sequences = tokenizer.texts_to_sequences(f_data[['Headline', 'articleBody']])
word_index = tokenizer.word_index
print('Vocab size:', len(word_index))
>> Vocab size: 2

data = pad_sequences(sequences, padding = 'post', maxlen = MAX_SEQ_LEN)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', y_train_cat.shape)

I have tried turning f_data into ndarray but get an attribute error.

f_data_2 = np.array(f_data[['Headline', 'articleBody']]) # ndarray
sequences = tokenizer.texts_to_sequences(apple)
AttributeError: 'numpy.ndarray' object has no attribute 'lower'

Any suggestions? I have looked at some other questions, but they are dealing with a list of strings


Solution: I think I finally got something to works, but I'm not entirely sure this is correct.

f_data = np.c_[(np.array(f_data['Headline']), np.array(f_data['articleBody']))]
f_data= f_data.tolist()

....

sequences = tokenizer.texts_to_sequences(f_data)
word_index = tokenizer.word_index
print('Vocab size:', len(word_index))
print(word_index)
>> Vocab size: 3239
>> {...'judicial watch reported isis members crossed mexican border': 12,
   'isis beheads photojournalist james wright foley nasage us end intervention iraq': 13, ...} 
 #(there's 3239 strings)

Update1:

Above is not solution. It seems that my tokenized sentences are only recording two value and the rest are 0:

>Tokenized sentences: 
 [1174  102    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    ....
    0    0    0    0]
 >shape (200,)

edit1:

f_head = np.array(f_data['Headline'].tolist())
f_body = np.array(f_data['articleBody'].tolist())            

#Head Tok, Seq., Pad
toke_head = Tokenizer(num_words=Max_Num_Wrd_Head)
toke_head.fit_on_texts(f_head)

seq_head = toke_head.texts_to_sequences(f_head)
wrd_indx_head = toke_head.word_index

data_Head = pad_sequences(seq_head, padding= 'post', maxlen = Max_Seq_Len_Head)

#Body Tok, Seq., Pad
toke_body = Tokenizer(num_words=MAX_NUM_WRDS_BODY)
toke_body.fit_on_texts(f_body)

seq_body = toke_body.texts_to_sequences(f_body)
wrd_indx_body = toke_body.word_index

data_Body = pad_sequences(seq_body, padding= 'post', maxlen = Max_Num_Wrd_Body)

Upvotes: 2

Views: 3020

Answers (1)

karthik_ghorpade
karthik_ghorpade

Reputation: 374

You could tokenize the two columns seperately and input them with two different input layers concatenate them and input them into the LSTM layer, right? If this approach works for you, I could explain how to do that.

Edit: If you are comfortable using Functional API, generate the 2 padded sequence inputs corresponding to the 2 columns as follows:

tokenizer.fit_on_texts(f_data['Headline'])
vocab_size = len(tokenizer.word_index) + 1

headline_sequences_train = tokenizer.texts_to_sequences(f_data['Headline'])
#headline_seq_validation = tokenizer.texts_to_sequences(val_data['Headline'])

headline_padded_train = pad_sequences(headline_sequences_train, padding='post', maxlen = MAX_SEQ_LEN)
#headline_padded_validation = pad_sequences(headline_seq_validation,padding = 'post',maxlen = MAX_SEQ_LEN)

Similarly for article body:

tokenizer.fit_on_texts(f_data['articleBody'])
vocab_size = len(tokenizer.word_index) + 1

art_body_seq_train = tokenizer.texts_to_sequences(f_data['articleBody'])
#art_body_seq_validation = tokenizer.texts_to_sequences(val_data['articleBody'])

art_body_padded_train = pad_sequences(art_body_seq_train, padding='post', maxlen = MAX_SEQ_LEN)
#art_body_padded_validation = pad_sequences(art_body_seq_validation, padding='post', maxlen = MAX_SEQ_LEN)

Note: MAX_SEQ_LEN may be different for the two different columns. Depends on your preference. I'd suggest you to analyse the word lengths of Headline and Article Body columns seperately and select different max sequence lengths which seem suitable.

headline_padded_train and art_body_padded_train are your two inputs corresponding to the two input layers in your neural network.

Upvotes: 3

Related Questions