Eka
Eka

Reputation: 15000

What keras pad_sequence do?

I am trying to code a text char-rnn in keras for that purpose I have to first convert text to sequence and then pad the sequence. But I am having a lot of trouble in implementing this step itself. I believe its because of my skewed or lack of understanding of this function (pad_sequence) itself. I tried to google it and didnt find any good tutorial and their is not much explained in the keras docs also.

Can any one tell me how and what is pad_sequence? Why we should pad the sequence (here character level) before feeding to.

Please consider this text as an example?

Take the 50-year-old man diagnosed with prostate cancer in my clinic at Brigham and Women's Hospital in Boston. He received a novel procedure to remove his prostate, and later received focused radiation to try to eradicate any remaining cancer. Unfortunately, his disease returned a year later. But after two new therapies, his cancer now appears in check. And if his cancer does spread, a host of other treatments — including many not even on the market yet — may put his cancer back in remission.

Upvotes: 3

Views: 5786

Answers (1)

Nassim Ben
Nassim Ben

Reputation: 11553

The way we train RNN's is to feed them a serie of sequences.

The RNN's have well known issues with backpropagation of the gradient (see Bengio & al). This is the reason why we usually feed limited sequences to the RNN to train it. So in your example, you should cut the text into smaller pieces (sentences?) in order to build your training set.

For the sake of simplicity of implementation, keras only accepts sequences of the same length in a batch (Recurrent Models with sequences of mixed length). So if your sequences don't have the same length, this is where the pad_sequence is useful.

pad_sequence takes a LIST of sequences as an input (list of list) and will return a list of padded sequences.

To get your example to work, you will have to somehow cut the text into sequences of chars. To do that you can pick the separator of your choice ('.' ? ) and then pad all the sentences to the same length. Or, smarter in my opinion, consider the text as a sequence of char (even spaces and \n), cut every n char and then feed this list of sequences as training data. This will avoid you to use padding except for the last sequence (if the number of char in your data is not a multiple of your sequence length n).

Of course, don't forget to tokenize your characters and embed them in a vector space before feeding them into the RNN. The RNN won't work on categorical data.

Upvotes: 5

Related Questions