Tom
Tom

Reputation: 753

How to load Text Dataset (question and answer) into numpy array for training a keras model

I have a dataset with questions and answers in this form:

[question...]?\t[answer...].

Example:

Do you like pizza?     Yes its delicious.
...                    

Now I want to train a keras model with it. But when I load it I can't turn it into a numpy array, because the sentences have not the same length.

In input_text and out_text I stored the questions and answers as splitted words like this:

[["Do", "you", "like", "pizza", "?"] 
 [ ... ]]

Here is a part of my code. (I also turn the words into vectors with a self made function)

X_data = []
Y_data = []

for i in range(len(input_text)):
    xdata = []
    ydata = []
    xdata = xdata+[wordtovec(word,wrdvecdic) for word in input_text[i]]
    for i in range(len(input_text[i])):
        ydata.append([0 for i in range(300)])

    xdata.append([0 for i in range(300)])
    ydata.append([0 for i in range(300)])

    ydata = ydata+[wordtovec(word, wrdvecdic) for word in out_text[i]]
    for i in range(len(out_text[i])):
        xdata.append([0 for i in range(300)])

    X_data.append(xdata)
    Y_data.append(ydata)

X_data = np.array(X_data)
Y_data = np.array(Y_data)

Maybe can show how to do this or have a link to an example of a similar dataset and how to load it into a numpy array for keras.

Thanks for your responses.

Upvotes: 1

Views: 516

Answers (1)

Amir
Amir

Reputation: 16597

I'm not aware of any tutorial specifically on QA but there is a nice tutorial on the related problem at Tensorflow official website.

Since our training data must be the same length, we usually use a padding function to standardize the lengths. e.g.:

from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from tensorflow.python.keras.preprocessing.text import Tokenizer

question = ['thi is test', 'this is another test 2']
answers = ['i am answer', 'second answer is here']

tknizer = Tokenizer()
tknizer.fit_on_texts(question + answers)
question = tknizer.texts_to_sequences(question)
answer = tknizer.texts_to_sequences(answer)
question = pad_sequences(question, value=0, padding='post', maxlen=20)
answer = pad_sequences(answer, value=0, padding='post', maxlen=20)

print(question)

output:

[[4 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [5 1 6 2 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

In the above example, we assume the maximum length is 20. Sequences longer than 20 are truncated so that they fit the desired length while sequences that are shorter than 20 are padded with 0 at the end.

Now, you can feed your preprocessed data into Keras:

inps1 = Input(shape=(20,))
inps2 = Input(shape=(20,))
embedding = Embedding(10000, 100)
emb1 = embedding(inps1)
emb2 = embedding(inps2)
# ... rest of the network
pred = Dense(100,'softmax')(prev_layer)
model = Model(inputs=[inps1, inps2],outputs=pred)
model.compile('adam', 'categorical_crossentropy')
model.fit([question, answer], labels, batch_size=128, epoch=100)

Upvotes: 1

Related Questions