Ziqi
Ziqi

Reputation: 2554

Understanding this DNN model and why it does not work on multi-label classification

I am fairly new to Keras and DNN in general and starting from some tutorials, I have managed to create a model for classifying sentences. The model is shown below. To be honest, I do not know for sure what is the intuition behind it and why it works. So this is my question.

def create_model():
    embedding_layer = Embedding(input_dim=100, output_dim=300,
                                input_length=100)
    model = Sequential()
    model.add(embedding_layer)
    model.add(Dropout(0.2))
    model.add(Conv1D(filters=100, kernel_size=4, padding='same', activation='relu'))
    model.add(MaxPooling1D(pool_size=4))
    model.add(LSTM(units=100, return_sequences=True))
    model.add(GlobalMaxPooling1D())
    #model.add(Dense(1, activation='sigmoid'))
    ###### multiclassification #########
    model.add(Dense(3, activation='sigmoid')) #I want to replace the above line with this for multi-classification but this didnt work
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

And here is my understanding: The model starts with training word embeddings on the corpus (of sentences), and represent each sentence as a vector of word vectors (embedding_layer). The dropout layer then forces the model to not rely on specific words. Convolution has a similar effect of identifying phrases/n-grams as opposed to just individual words; then an LSTM follows to learn sequences of phrases/n-grams that may be useful features; The Globalmaxpooling1D layer then 'flattens' the LSTM output as features for the final classification (dense layer).

Does this make any sense? I also do not quite understand the interaction between the maxpooling1D layer and the lstm layer. What's the input_shape to lstm and what does the output look like?

Upvotes: 2

Views: 511

Answers (2)

Daniel Möller
Daniel Möller

Reputation: 86600

Multiclass models:

The multiclassification model ending with Dense(3,activation='sigmoid') is ok for a multiclass with 3 possible classes.

But it should only use 'categorical_crossentropy' if there is only one correct class among the 3. In this case, the activation function should be 'softmax'.

A 'softmax' will guarantee that all the classes sum 1. It's good when you want only one correct class.
A 'sigmoid' will not care about the relation between the 3 classes, they can coexist as all ones or all zeros. In this case, use a 'binary_crossentropy'.

LSTM and GlobalMaxPooling:

The LSTM input is (batchSize, timeSteps, featuresOrDimension).
The output can be two:

  • With return_sequences = True: (batchSize, timeSteps, units)
  • With return_sequences = False: (batchSize, units).

Since you chose the True case, there is the timeSteps dimension, and the GlobalMaxPooling1D will take the highest value in that dimension and discard the others, resulting in (batchSize,units).

It's pretty much like using only LSTM(units,return_sequences=False). But this one takes the last step in the sequence, while the maxpooling will take the maximum step.

Upvotes: 0

Marcin Możejko
Marcin Możejko

Reputation: 40516

So, your intuition is right. Everything you told holds. About MaxPooling1D - it's a way to downsample the output from Conv1D. The output from this layer will be 4-times smaller than the original output from Conv1D (so input to LSTM will have a length of 25 with the same number of features. Just to show you how it works:

output from Conv1D :

0, 1, 1, 0, -1, 2, 3, 5, 1, 2, 1, -1

input to LSTM :

1 (max from 0, 1, 1, 0), 5 (max from -1, 2, 3, 5), 2 (max from 1, 2, 1, -1)

Edit I haven't noticed categorical_crossentropy and activations. So:

  1. If your output is one out of 3 classes you could use categorical_crossentropy and sigmoid but then your input cannot be interpretable as probability distribution but as class score (prediction is equal to a class with a highest score). Better option is to use softmax which produces a probability distribution over classes.

  2. In case of 3 classes prediction (not mutually exclusive) due to Keras implementation you should use binary_crossentropy even though it's mathematically equivalent to categorical_crossentropy. It's because keras normalizes outputs from the last layer and makes them to sum up to 1. This might seriously harm your training.

Upvotes: 2

Related Questions