Reputation: 1072
I am reading through the tensorflow tutorials on neural network and i came across the architecture part which is a bit confusing. Can some explain me why he had use following settings in this code
# input shape is the vocabulary count used for the movie reviews
(10,000 words) vocab_size = 10000 model = keras.Sequential() model.add(keras.layers.Embedding(vocab_size, 16)) model.add(keras.layers.GlobalAveragePooling1D()) model.add(keras.layers.Dense(16, activation=tf.nn.relu)) model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid)) model.summary()
Vocab_size? value of 16 for Embedding? and the choice of units, i get the intuition behind the last dense layer because it is a binary classification(1) but why 16 units in the second layer? Is the 16 in embedding and 16 units in first dense layer related? Like they should be equal?
If someone can explain this para too
The first layer is an Embedding layer. This layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: (batch, sequence, embedding).
source: Classify movie reviews: binary classification
Upvotes: 0
Views: 873
Reputation: 16587
[Batch_size * seq_len]
.[Batch_size * seq_len * embedding_dim]
.[Batch_size * seq_len * embedding_dim]
into [Batch_size * embedding_dim]
[Batch_size * embedding_dim]
into [Batch_size * unit]
.Upvotes: 1
Reputation: 534
The first layer is vocab_size
because each word is represented as an index into the vocabulary. For example, if the input word is 'word', which is the 500th word in the vocabulary, the input is a vector of length vocab_size
with all zeros except a one at index 500. This is commonly referred to as a 'one hot' representation.
The embedding layer essentially takes this huge input vector and condenses it into a smaller vector (in this case, length 16) that encodes some of information about the word. The specific embedding weights are learned from training just like any other neural network layer. I'd recommend reading up on word embeddings. The length of 16 is a bit arbitrary here but can be tuned. One could do away with this embedding layer but then the model will have less expressive power (it would just be logistic regression, which is a linear model).
Then, as you said, the last layer is simply predicting the class of the word based on the embedding.
Upvotes: 1