Reputation: 1731
Here is my input data:
data['text'].head()
0 process however afforded means ascertaining di...
1 never occurred fumbling might mere mistake
2 left hand gold snuff box which capered hill cu...
3 lovely spring looked windsor terrace sixteen f...
4 finding nothing else even gold superintendent ...
Name: text, dtype: object
And here is the one hot encoded label (multi-class classification where the number of classes = 3)
[[1 0 0]
[0 1 0]
[1 0 0]
...
[1 0 0]
[1 0 0]
[0 1 0]]
Here is what I think happens step by step, please correct me if I'm wrong:
Converting my input text data['text']
to a bag of indices (sequences)
vocabulary_size = 20000
tokenizer = Tokenizer(num_words = vocabulary_size)
tokenizer.fit_on_texts(data['text'])
sequences = tokenizer.texts_to_sequences(data['text'])
data = pad_sequences(sequences, maxlen=50)
What is happening is my data['text'].shape
which is of shape (19579, )
is being converted into an array of indices of shape (19579, 50)
, where each word is being replaced by the index found in tokenizer.word_index.items()
Loading the glove 100d
word vector
embeddings_index = dict()
f = open('/Users/abhishekbabuji/Downloads/glove.6B/glove.6B.100d.txt')
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
print(embedding_index)
{'the': array([-0.038194, -0.24487 , 0.72812 , -0.39961 , 0.083172, 0.043953,
-0.39141 , 0.3344 , -0.57545 , 0.087459, 0.28787 , -0.06731 ,
0.30906 , -0.26384 , -0.13231 , -0.20757 , 0.33395 , -0.33848 ,
-0.31743 , -0.48336 , 0.1464 , -0.37304 , 0.34577 , 0.052041,
0.44946 , -0.46971 , 0.02628 , -0.54155 , -0.15518 , -0.14107 ,
-0.039722, 0.28277 , 0.14393 , 0.23464 , -0.31021 , 0.086173,
0.20397 , 0.52624 , 0.17164 , -0.082378, -0.71787 , -0.41531 ,
0.20335 , -0.12763 , 0.41367 , 0.55187 , 0.57908 , -0.33477 ,
-0.36559 , -0.54857 , -0.062892, 0.26584 , 0.30205 , 0.99775 ,
-0.80481 , -3.0243 , 0.01254 , -0.36942 , 2.2167 , 0.72201 ,
-0.24978 , 0.92136 , 0.034514, 0.46745 , 1.1079 , -0.19358 ,
-0.074575, 0.23353 , -0.052062, -0.22044 , 0.057162, -0.15806 ,
-0.30798 , -0.41625 , 0.37972 , 0.15006 , -0.53212 , -0.2055 ,
-1.2526 , 0.071624, 0.70565 , 0.49744 , -0.42063 , 0.26148 ,
-1.538 , -0.30223 , -0.073438, -0.28312 , 0.37104 , -0.25217 ,
0.016215, -0.017099, -0.38984 , 0.87424 , -0.72569 , -0.51058 ,
-0.52028 , -0.1459 , 0.8278 , 0.27062 ], dtype=float32),
So what we have now are the word vectors for every word of 100 dimensions.
Creating the embedding matrix using the glove word vector
vocabulary_size = 20000
embedding_matrix = np.zeros((vocabulary_size, 100))
for word, index in tokenizer.word_index.items():
if index > vocabulary_size - 1:
break
else:
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[index] = embedding_vector
So we now have the a vector
of 100 dimensions for EACH of the 20000 words. The
And here is the architecture:
model_glove = Sequential()
model_glove.add(Embedding(vocabulary_size, 100, input_length=50, weights=[embedding_matrix], trainable=False))
model_glove.add(Dropout(0.5))
model_glove.add(Conv1D(64, 5, activation='relu'))
model_glove.add(MaxPooling1D(pool_size=4))
model_glove.add(LSTM(100))
model_glove.add(Dense(3, activation='softmax'))
model_glove.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model_glove.summary())
I get
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_7 (Embedding) (None, 50, 100) 2000000
_________________________________________________________________
dropout_7 (Dropout) (None, 50, 100) 0
_________________________________________________________________
conv1d_7 (Conv1D) (None, 46, 64) 32064
_________________________________________________________________
max_pooling1d_7 (MaxPooling1 (None, 11, 64) 0
_________________________________________________________________
lstm_7 (LSTM) (None, 100) 66000
_________________________________________________________________
dense_7 (Dense) (None, 3) 303
=================================================================
Total params: 2,098,367
Trainable params: 98,367
Non-trainable params: 2,000,000
_________________________________________________________________
The input to the above architecture will be the training data
array([[ 0, 0, 0, ..., 4867, 22, 340],
[ 0, 0, 0, ..., 12, 327, 2301],
[ 0, 0, 0, ..., 255, 388, 2640],
...,
[ 0, 0, 0, ..., 17, 15609, 15242],
[ 0, 0, 0, ..., 9517, 9266, 442],
[ 0, 0, 0, ..., 3399, 379, 5927]], dtype=int32)
of shape (19579, 50)
and labels as one hot encodings..
My trouble is understanding the following what exactly is happening to my (19579, 50)
as it goes through each of the following lines:
model_glove = Sequential()
model_glove.add(Embedding(vocabulary_size, 100, input_length=50, weights=[embedding_matrix], trainable=False))
model_glove.add(Dropout(0.5))
model_glove.add(Conv1D(64, 5, activation='relu'))
model_glove.add(MaxPooling1D(pool_size=4))
I understand why we need model_glove.add(Dropout(0.5))
, this is to shut down some hidden units with a probability of 0.5 to avoid the model from being overly complex. But I have no idea why we need the Conv1D(64, 5, activation='relu')
, the MaxPooling1D(pool_size=4)
and how this goes into my model_glove.add(LSTM(100))
unit..
Upvotes: 4
Views: 2910
Reputation: 5822
The simplest way to understand a convolution
is to think of it as a mapping that tells a neural network to which features (pixels in the case of image recognition, where you would use a 2D convolution; or words before or after a given word for text, where you would use a 1D convolution) are nearby. Without this, the network has no way of knowing that words just before or just after a given word are more relevant than words that are much further away. It typically also results in information being presented in a much more densely packed format, thereby greatly reducing the number of parameters (in your case down from 2 million to 30 thousand). I find that this answer explains the technicality of how it works rather well: https://stackoverflow.com/a/52353721/141789
Max pooling
is a method that downsamples your data. It is often used directly after convolutions and achieves two things:
2
for example)max
value, in theory in "sharpens" the contrast between the pools by taking the maximum value instead of for example taking the average.Max pooling is not "learnt"; it is just a simple arithmetic calculation. That is why the number of parameters is given as zero. The same for dropout
.
An LSTM
expects a three dimensional input of shape (number of samples, number of timesteps, number of features)
. Having performed the previous convolution and max pooling steps you've reduced the representation of your initial embedding to number of timesteps = 11
and number of features = 64
. The first value number of samples = None
is a placeholder for the batch size
you plan to use. By initializing an LSTM with 100 units
(also known as hidden states
) you are parameterizing the size of the "memory" of the LSTM: essentially the accumulation of its input, output and forget gates through time.
Upvotes: 4