Understanding word embeddings, convolutional layer and max pooling layer in LSTMs and RNNs for NLP Text Classification

Question

Here is my input data:

data['text'].head()

0    process however afforded means ascertaining di...
1          never occurred fumbling might mere mistake 
2    left hand gold snuff box which capered hill cu...
3    lovely spring looked windsor terrace sixteen f...
4    finding nothing else even gold superintendent ...
Name: text, dtype: object

And here is the one hot encoded label (multi-class classification where the number of classes = 3)

[[1 0 0]
 [0 1 0]
 [1 0 0]
 ...
 [1 0 0]
 [1 0 0]
 [0 1 0]]

Here is what I think happens step by step, please correct me if I'm wrong:

Converting my input text data['text'] to a bag of indices (sequences)

vocabulary_size = 20000

tokenizer = Tokenizer(num_words = vocabulary_size)
tokenizer.fit_on_texts(data['text'])
sequences = tokenizer.texts_to_sequences(data['text'])

data = pad_sequences(sequences, maxlen=50)

What is happening is my data['text'].shape which is of shape (19579, ) is being converted into an array of indices of shape (19579, 50), where each word is being replaced by the index found in tokenizer.word_index.items()

Loading the glove 100d word vector

embeddings_index = dict()
f = open('/Users/abhishekbabuji/Downloads/glove.6B/glove.6B.100d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print(embedding_index)
    {'the': array([-0.038194, -0.24487 ,  0.72812 , -0.39961 ,  0.083172,  0.043953,
    -0.39141 ,  0.3344  , -0.57545 ,  0.087459,  0.28787 , -0.06731 ,
     0.30906 , -0.26384 , -0.13231 , -0.20757 ,  0.33395 , -0.33848 ,
    -0.31743 , -0.48336 ,  0.1464  , -0.37304 ,  0.34577 ,  0.052041,
     0.44946 , -0.46971 ,  0.02628 , -0.54155 , -0.15518 , -0.14107 ,
    -0.039722,  0.28277 ,  0.14393 ,  0.23464 , -0.31021 ,  0.086173,
     0.20397 ,  0.52624 ,  0.17164 , -0.082378, -0.71787 , -0.41531 ,
     0.20335 , -0.12763 ,  0.41367 ,  0.55187 ,  0.57908 , -0.33477 ,
    -0.36559 , -0.54857 , -0.062892,  0.26584 ,  0.30205 ,  0.99775 ,
    -0.80481 , -3.0243  ,  0.01254 , -0.36942 ,  2.2167  ,  0.72201 ,
    -0.24978 ,  0.92136 ,  0.034514,  0.46745 ,  1.1079  , -0.19358 ,
    -0.074575,  0.23353 , -0.052062, -0.22044 ,  0.057162, -0.15806 ,
    -0.30798 , -0.41625 ,  0.37972 ,  0.15006 , -0.53212 , -0.2055  ,
    -1.2526  ,  0.071624,  0.70565 ,  0.49744 , -0.42063 ,  0.26148 ,
    -1.538   , -0.30223 , -0.073438, -0.28312 ,  0.37104 , -0.25217 ,
     0.016215, -0.017099, -0.38984 ,  0.87424 , -0.72569 , -0.51058 ,
    -0.52028 , -0.1459  ,  0.8278  ,  0.27062 ], dtype=float32),

So what we have now are the word vectors for every word of 100 dimensions.

Creating the embedding matrix using the glove word vector

vocabulary_size = 20000
embedding_matrix = np.zeros((vocabulary_size, 100))

for word, index in tokenizer.word_index.items():
    if index > vocabulary_size - 1:
        break
    else:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[index] = embedding_vector

So we now have the a vector of 100 dimensions for EACH of the 20000 words. The

And here is the architecture:

model_glove = Sequential()
model_glove.add(Embedding(vocabulary_size, 100, input_length=50, weights=[embedding_matrix], trainable=False))
model_glove.add(Dropout(0.5))
model_glove.add(Conv1D(64, 5, activation='relu')) 
model_glove.add(MaxPooling1D(pool_size=4))
model_glove.add(LSTM(100))
model_glove.add(Dense(3, activation='softmax'))
model_glove.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model_glove.summary())

I get

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_7 (Embedding)      (None, 50, 100)           2000000   
_________________________________________________________________
dropout_7 (Dropout)          (None, 50, 100)           0         
_________________________________________________________________
conv1d_7 (Conv1D)            (None, 46, 64)            32064     
_________________________________________________________________
max_pooling1d_7 (MaxPooling1 (None, 11, 64)            0         
_________________________________________________________________
lstm_7 (LSTM)                (None, 100)               66000     
_________________________________________________________________
dense_7 (Dense)              (None, 3)                 303       
=================================================================
Total params: 2,098,367
Trainable params: 98,367
Non-trainable params: 2,000,000
_________________________________________________________________

The input to the above architecture will be the training data

array([[    0,     0,     0, ...,  4867,    22,   340],
       [    0,     0,     0, ...,    12,   327,  2301],
       [    0,     0,     0, ...,   255,   388,  2640],
       ...,
       [    0,     0,     0, ...,    17, 15609, 15242],
       [    0,     0,     0, ...,  9517,  9266,   442],
       [    0,     0,     0, ...,  3399,   379,  5927]], dtype=int32)

of shape (19579, 50)

and labels as one hot encodings..

My trouble is understanding the following what exactly is happening to my (19579, 50) as it goes through each of the following lines:

model_glove = Sequential()
model_glove.add(Embedding(vocabulary_size, 100, input_length=50, weights=[embedding_matrix], trainable=False))
model_glove.add(Dropout(0.5))
model_glove.add(Conv1D(64, 5, activation='relu')) 
model_glove.add(MaxPooling1D(pool_size=4))

I understand why we need model_glove.add(Dropout(0.5)), this is to shut down some hidden units with a probability of 0.5 to avoid the model from being overly complex. But I have no idea why we need the Conv1D(64, 5, activation='relu'), the MaxPooling1D(pool_size=4) and how this goes into my model_glove.add(LSTM(100)) unit..

Karl · Accepted Answer

The simplest way to understand a convolution is to think of it as a mapping that tells a neural network to which features (pixels in the case of image recognition, where you would use a 2D convolution; or words before or after a given word for text, where you would use a 1D convolution) are nearby. Without this, the network has no way of knowing that words just before or just after a given word are more relevant than words that are much further away. It typically also results in information being presented in a much more densely packed format, thereby greatly reducing the number of parameters (in your case down from 2 million to 30 thousand). I find that this answer explains the technicality of how it works rather well: https://stackoverflow.com/a/52353721/141789

Max pooling is a method that downsamples your data. It is often used directly after convolutions and achieves two things:

It again reduces the number of parameters. In your case, it will represent four values with a single value (the max of the four values). It does this by taking the first four values, then taking a "stride" of size four and taking the next four values etc. In other words, there will be no overlap between the pools. (This is what keras does by default, but you could also set the stride to 2 for example)
Secondly, because it takes the max value, in theory in "sharpens" the contrast between the pools by taking the maximum value instead of for example taking the average.

Max pooling is not "learnt"; it is just a simple arithmetic calculation. That is why the number of parameters is given as zero. The same for dropout.

An LSTM expects a three dimensional input of shape (number of samples, number of timesteps, number of features). Having performed the previous convolution and max pooling steps you've reduced the representation of your initial embedding to number of timesteps = 11 and number of features = 64. The first value number of samples = None is a placeholder for the batch size you plan to use. By initializing an LSTM with 100 units (also known as hidden states) you are parameterizing the size of the "memory" of the LSTM: essentially the accumulation of its input, output and forget gates through time.

Understanding word embeddings, convolutional layer and max pooling layer in LSTMs and RNNs for NLP Text Classification

Answers (1)

Related Questions