John321
John321

Reputation: 45

Feed Forward - Neural Networks Keras

for my input in the feed forward neural network that I have implemented in Keras, I just wanted to check that my understanding is correct.

[[ 25.26000023  26.37000084  24.67000008  23.30999947]
[ 26.37000084  24.67000008  23.30999947  21.36000061]
[ 24.67000008  23.30999947  21.36000061  19.77000046]...]

So in the data above it is a time window of 4 inputs in an array. My input layer is

model.add(Dense(4, input_dim=4, activation='sigmoid')) 

model.fit(trainX, trainY, nb_epoch=10000,verbose=2,batch_size=4)

and batch_size is 4, in theory when I call the fit function will the function go over all these inputs in each nb_epoch? and does the batch_size need to be 4 in order for this time window to work?

Thanks John

Upvotes: 1

Views: 844

Answers (2)

lejlot
lejlot

Reputation: 66805

and batch_size is 4, in theory when I call the fit function will the function go over all these inputs in each nb_epoch?

Yes, each epoch is iteration over all training samples

and does the batch_size need to be 4 in order for this time window to work?

No, these are completely unrelated things. Batch is simply a subset of your training data which is used to compute approximation of the true gradient of the cost function. Bigger the batch - closer you get to the true gradient (and original Gradient Descent), but training gets slower. Closer to 1 you get - it becomes more and more stochastic, noisy approxmation (and closer to Stochastic Gradient Descent). The fact that you matched batch_size and data dimensionality is just an odd-coincidence, and has no meaning.

Let me put this in more generall setting, what you do in gradient descent with additive loss function (which neural nets usually use) is going against the gradient which is

grad_theta 1/N SUM_i=1^N loss(x_i, pred(x_i), y_i|theta) =  
 = 1/N SUM_i=1^N grad_theta loss(x_i, pred(x_i), y_i|theta)

where loss is some loss function over your pred (prediction) as compared to y_i.

And in batch based scenatio (the rough idea) is that you do not need to go over all examples, but instead some strict subset, like batch = {(x_1, y_1), (x_5, y_5), (x_89, y_89) ... } and use approximation of the gradient of form

1/|batch| SUM_(x_i, y_i) in batch: grad_theta loss(x_i, pred(x_i), y_i|theta)

As you can see this is not related in any sense to the space where x_i live, thus there is no connection with dimensionality of your data.

Upvotes: 1

7VoltCrayon
7VoltCrayon

Reputation: 662

Let me explain this with an example:

When you have 32 training examples and you call model.fit with a batch_size of 4, the neural network will be presented with 4 examples at a time, but one epoch will still be defined as one complete pass over all 32 examples. So in this case the network will go through 4 examples at a time, and will ,theoretically at least, call the forward pass (and the backward pass) 32 / 4 = 8 times.

In the extreme case when your batch_size is 1, that is plain old stochastic gradient descent. When your batch_size is greater than 1 then it's called batch gradient descent.

Upvotes: 1

Related Questions