How to deal with length variations for text classification using CNN (Keras)

It has been proved that CNN (convolutional neural network) is quite useful for text/document classification. I wonder how to deal with the length differences as the lengths of articles are different in most cases. Are there any examples in Keras? Thanks!!

Upvotes: 11

Answers (4)

janst

Reputation: 142

I just made a model in Keras using their LSTM RNN model. It forced me to pad my inputs(I.e. the sentences). However, I just added an empty string to the sentence until it was the desired length. Possibly = to the length of the feature with the max length (in words). Then I was able to use glove to transform my features into vector space before running through my model.

def getWordVector(X):
    global num_words_kept
    global word2vec
    global word_vec_dim

    input_vector = []
    for row in X:
        words = row.split()
        if len(words) > num_words_kept:
            words = words[:num_words_kept]
        elif len(words) < num_words_kept:
            for i in range(num_words_kept - len(words)):
                words.append("")
        input_to_vector = []
        for word in words:
            if word in word2vec:
                input_to_vector.append(np.array(word2vec[word]).astype(np.float).tolist())#multidimensional wordvecor
            else:
                input_to_vector.append([5.0] * word_vec_dim)#place a number that is far different than the rest so as not to be to similar
        input_vector.append(np.array(input_to_vector).tolist())
    input_vector = np.array(input_vector)
    return input_vector

Where X is the list of sentences this function will return a word vector(using glove's word_to_vec) of the features with num_words_kept length for each one in the returned array. So I am using both padding and truncating. (Padding for Keras implementation and truncating because when you have such vast differences in the sizes of your inputs Keras also has issues... I'm not entirely sure why. I had issues when I started padding some sentences with more than 100 empty strings.

X = getWordVectors(features)
y = to_categorical(y)# for categorical_crossentropy
model.fit(X, y, batch_size=16, epochs=5, shuffle=False)

Keras requires that you use numpy arrays before feeding your data in therefore both my features and labels are numpy arrays.

Upvotes: 0

a11apurva

Reputation: 138

One possible solution is to send your sequences in batches of 1.

n_batch = 1
model.fit(X, y, epochs=1, batch_size=n_batch, verbose=1, shuffle=False)

This issue at official keras repo gives a good insight and a possible solution: https://github.com/keras-team/keras/issues/85

Quoting patyork's comment:

There are two simple and most often implemented ways of handling this:

Bucketing and Padding

Separate input sample into buckets that have similar length, ideally such that each bucket has a number of samples that is a multiple of the mini-batch size For each bucket, pad the samples to the length of the longest sample in that bucket with a neutral number. 0's are frequent, but for something like speech data, a representation of silence is used which is often not zeros (e.g. the FFT of a silent portion of audio is used as a neutral padding).

Bucketing

Separate input samples into buckets of exactly the same length removes the need for determining what a neutral padding is however, the size of the buckets in this case will frequently not be a multiple of the mini-batch size, so in each epoch, multiple times the updates will not be based on a full mini-batch.

Upvotes: 2

pedrobisp

Reputation: 717

You can see a concrete example here: https://github.com/fchollet/keras/blob/master/examples/imdb_cnn.py

Upvotes: 3

1''

Reputation: 27105

Here are three options:

Crop the longer articles.
Pad the shorter articles.
Use a recurrent neural network, which naturally supports variable-length inputs.

Upvotes: 3

How to deal with length variations for text classification using CNN (Keras)

Answers (4)

Related Questions