Understanding this DNN model and why it does not work on multi-label classification

Question

I am fairly new to Keras and DNN in general and starting from some tutorials, I have managed to create a model for classifying sentences. The model is shown below. To be honest, I do not know for sure what is the intuition behind it and why it works. So this is my question.

def create_model():
    embedding_layer = Embedding(input_dim=100, output_dim=300,
                                input_length=100)
    model = Sequential()
    model.add(embedding_layer)
    model.add(Dropout(0.2))
    model.add(Conv1D(filters=100, kernel_size=4, padding='same', activation='relu'))
    model.add(MaxPooling1D(pool_size=4))
    model.add(LSTM(units=100, return_sequences=True))
    model.add(GlobalMaxPooling1D())
    #model.add(Dense(1, activation='sigmoid'))
    ###### multiclassification #########
    model.add(Dense(3, activation='sigmoid')) #I want to replace the above line with this for multi-classification but this didnt work
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

And here is my understanding: The model starts with training word embeddings on the corpus (of sentences), and represent each sentence as a vector of word vectors (embedding_layer). The dropout layer then forces the model to not rely on specific words. Convolution has a similar effect of identifying phrases/n-grams as opposed to just individual words; then an LSTM follows to learn sequences of phrases/n-grams that may be useful features; The Globalmaxpooling1D layer then 'flattens' the LSTM output as features for the final classification (dense layer).

Does this make any sense? I also do not quite understand the interaction between the maxpooling1D layer and the lstm layer. What's the input_shape to lstm and what does the output look like?

Marcin Możejko · Accepted Answer

So, your intuition is right. Everything you told holds. About MaxPooling1D - it's a way to downsample the output from Conv1D. The output from this layer will be 4-times smaller than the original output from Conv1D (so input to LSTM will have a length of 25 with the same number of features. Just to show you how it works:

output from Conv1D :

0, 1, 1, 0, -1, 2, 3, 5, 1, 2, 1, -1

input to LSTM :

1 (max from 0, 1, 1, 0), 5 (max from -1, 2, 3, 5), 2 (max from 1, 2, 1, -1)

Edit I haven't noticed categorical_crossentropy and activations. So:

If your output is one out of 3 classes you could use categorical_crossentropy and sigmoid but then your input cannot be interpretable as probability distribution but as class score (prediction is equal to a class with a highest score). Better option is to use softmax which produces a probability distribution over classes.
In case of 3 classes prediction (not mutually exclusive) due to Keras implementation you should use binary_crossentropy even though it's mathematically equivalent to categorical_crossentropy. It's because keras normalizes outputs from the last layer and makes them to sum up to 1. This might seriously harm your training.

Understanding this DNN model and why it does not work on multi-label classification

Answers (2)

Related Questions