shammery
shammery

Reputation: 1072

How to setup a neural network architecture for binary classification

I am reading through the tensorflow tutorials on neural network and i came across the architecture part which is a bit confusing. Can some explain me why he had use following settings in this code

# input shape is the vocabulary count used for the movie reviews 
(10,000 words)
vocab_size = 10000

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

model.summary()

Vocab_size? value of 16 for Embedding? and the choice of units, i get the intuition behind the last dense layer because it is a binary classification(1) but why 16 units in the second layer? Is the 16 in embedding and 16 units in first dense layer related? Like they should be equal?

If someone can explain this para too

The first layer is an Embedding layer. This layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: (batch, sequence, embedding).

source: Classify movie reviews: binary classification

Upvotes: 0

Views: 873

Answers (2)

Amir
Amir

Reputation: 16587

  • vocab_size: All word in your corpus (in this case IMDB) sorted based on their frequency and their top 10000 word extracted. Rest of the vocabulary will be ignored. E.g: This is really Fancyyyyyyy will convert into ==> [8 7 9]. As you may guess the word Fancyyyyyyy ignored because its not in out top 10000 words.
  • pad_sequences: Convert all sentence to the same size. For example in training corpus the document length are different. So all of them convert to seq_len = 256. After this step, your output is [Batch_size * seq_len].
  • Embedding: Each word converted to a vector with 16 dimension. As a result output of this step is a Tensor with size of [Batch_size * seq_len * embedding_dim].
  • GlobalAveragePooling1D: Convert your sequence with size of [Batch_size * seq_len * embedding_dim] into [Batch_size * embedding_dim]
  • unit: is output of dense layer (MLP layer). It covert [Batch_size * embedding_dim] into [Batch_size * unit].

Upvotes: 1

twhughes
twhughes

Reputation: 534

The first layer is vocab_size because each word is represented as an index into the vocabulary. For example, if the input word is 'word', which is the 500th word in the vocabulary, the input is a vector of length vocab_size with all zeros except a one at index 500. This is commonly referred to as a 'one hot' representation.

The embedding layer essentially takes this huge input vector and condenses it into a smaller vector (in this case, length 16) that encodes some of information about the word. The specific embedding weights are learned from training just like any other neural network layer. I'd recommend reading up on word embeddings. The length of 16 is a bit arbitrary here but can be tuned. One could do away with this embedding layer but then the model will have less expressive power (it would just be logistic regression, which is a linear model).

Then, as you said, the last layer is simply predicting the class of the word based on the embedding.

Upvotes: 1

Related Questions