Reputation: 459
I am new to keras and have read blog posts about deep learning classification using keras but, even after reading a lot of them, I am unable to figure out how each of them have calculated the parameter value of first dense layer just after flatten layer in their code. for example:
from keras.models import Sequential
from keras.layers import Dense, Conv2D, MaxPooling2D, Dropout, Flatten
def createModel():
model = Sequential()
model.add(Conv2D(32, (3, 3), padding='same', activation='relu',input_shape=input_shape))
model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(64, (3, 3), padding='same', activation='relu'))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(64, (3, 3), padding='same', activation='relu'))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(nClasses, activation='softmax'))
My doubts:
If I put too large a value, like in my code below, going by the logic I multiplied my flatten parameter 86400 by 2 i.e. 172800, I get the following error:
model = Sequential()
model.add(Conv2D(32, (3, 3),input_shape=input_shape))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(64, (3, 3) ))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(96, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
> model.add(Dense(172800))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(4))
model.add(Activation('softmax'))
model.summary()
ValueError: rng_mrg cpu-implementation does not support more than (2**31 -1) samples
HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'. HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.
This is my summary of the model without first dense layer
Layer (type) Output Shape Param #
=================================================================
conv2d_4 (Conv2D) (None, 254, 254, 32) 896
_________________________________________________________________
activation_4 (Activation) (None, 254, 254, 32) 0
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 127, 127, 32) 0
_________________________________________________________________
dropout_4 (Dropout) (None, 127, 127, 32) 0
_________________________________________________________________
conv2d_5 (Conv2D) (None, 125, 125, 64) 18496
_________________________________________________________________
activation_5 (Activation) (None, 125, 125, 64) 0
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 62, 62, 64) 0
_________________________________________________________________
dropout_5 (Dropout) (None, 62, 62, 64) 0
_________________________________________________________________
conv2d_6 (Conv2D) (None, 60, 60, 96) 55392
_________________________________________________________________
activation_6 (Activation) (None, 60, 60, 96) 0
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 30, 30, 96) 0
_________________________________________________________________
dropout_6 (Dropout) (None, 30, 30, 96) 0
_________________________________________________________________
flatten_2 (Flatten) (None, 86400) 0
_________________________________________________________________
activation_7 (Activation) (None, 86400) 0
_________________________________________________________________
dropout_7 (Dropout) (None, 86400) 0
_________________________________________________________________
dense_2 (Dense) (None, 4) 345604
_________________________________________________________________
activation_8 (Activation) (None, 4) 0
Total params: 420,388
Trainable params: 420,388
Non-trainable params: 0
When I eliminate this layer altogether, my code works or even if I put smaller value, my code still works but, I don't want to blindly set parameters without knowing the reason.
Upvotes: 1
Views: 2846
Reputation: 2533
Many design decisions in Deep Learning come down to pragmatic rules that seem to work fairly well after trying different options.
The size of the second to last Dense layer is one of those examples. By giving a network more depth (more layers) and/or making it wider (more channels), we increase the theoretical learning capacity of the model. However, simply giving a network 10000 Dense layers with 172800 channels will likely not improve performance or even work at all.
In theory, 512 is completely arbitrary. In practice, it's inside the scope of sizes I have seen in other architectures. I understand your decision to connect the number of input units to the number of output units with a rate of 2. While it's entirely possible that this is the greatest idea in Deep Learning anyone has ever come up with, I commonly see examples where the size of the second to last dense layer is connected to the number of output classes in the final layer.
So as a rule of thumb, you could try to play with these rates of 2x to 4x and see where it gets you. The layer you tried to create would have had 15 billion parameters. That alone is roughly 100 times larger than the biggest architectures I have seen.
At this point I would like to stop guessing further recommendations, because it is dependent on so many factors.
Upvotes: 2