model.evaluate() changes results depending on batch size, when fed by generator

Question

Working in colab, with default tensorflow and keras versions (which print tensorflow 2.2.0-rc2, keras 2.3.0-tf )

I've got a superweird error. Basically, the results of model.evaluate() depend on the batch size I'm using and they change after I shuffle the data. Which makes no sense. I've been able to reproduce this in a minimally working example. In my full program (which works in 3D with bigger datasets) the variations are even more significant. I don't know whether this might depend on batch normalization... But I expect it to be fixed when I'm predicting! My full program is doing multiclass segmentation, my minimal example takes a black image with a white square in a random position, with some little noise, and tries to segment the same white square out of it. I'm using keras sequence as generators to feed data to the model, which I guess might be relevant as I don't see the behaviour when evaluating the data directly. Here's the code with its output:

#environment setup
%tensorflow_version 2.x
from tensorflow.keras import backend as K
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input,Conv2D, Activation, BatchNormalization 
from tensorflow.keras import metrics

#set up a toy model
K.set_image_data_format("channels_last")
inputL = Input([64,64,1])
l1 = Conv2D(4,[3,3],padding='same')(inputL)
l1N = BatchNormalization(axis=-1,momentum=0.9) (l1)
l2 = Activation('relu') (l1N)
l3 = Conv2D(32,[3,3],padding='same')(l2)
l3N = BatchNormalization(axis=-1,momentum=0.9) (l3)
l4 = Activation('relu') (l3N)
l5 = Conv2D(1,[1,1],padding='same',dtype='float32')(l4)
l6 = Activation('sigmoid') (l5)
model = Model(inputs=inputL,outputs=l6)
model.compile(optimizer='sgd',loss='mse',metrics='accuracy' )

#Create random images
import numpy as np
import random
X_train = np.zeros([96,64,64,1])
for imIdx in range(96):
  centPoin = random.randrange(7,50)
  X_train[imIdx,centPoin-5:centPoin+5,centPoin-5:centPoin+5,0]=1

X_val = X_train[:32,:,:,:]
X_train = X_train[32:,:,:,:]
Y_train = X_train.copy()
X_train = np.random.normal(0.,0.1,size=X_train.shape)+X_train
for imIdx in range(64):
  X_train[imIdx,:,:,:] = X_train[imIdx,:,:,:]+np.random.normal(0,0.2,size=1)

from tensorflow.keras.utils import Sequence
import random
import tensorflow as tf

#setup the data generator
class dataGen (Sequence):
  def __init__ (self,x_set,y_set,batch_size):
    self.x, self.y = x_set, y_set
    self.batch_size = batch_size
    nSamples = self.x.shape[0]
    patList = np.array(range(nSamples),dtype='int16')
    patList = patList.reshape(nSamples,1)
    np.random.shuffle(patList)
    self.patList = patList

  def __len__ (self):
    return round(self.patList.shape[0] / self.batch_size)

  def __getitem__ (self, idx):
    patStart = idx
    batchS = self.batch_size
    listLen = self.patList.shape[0]
    Xout = np.zeros((batchS,64,64,1))
    Yout = np.zeros((batchS,64,64,1))    
    for patIdx in range(batchS):
       curPat = (patStart+patIdx) % listLen
       patInd = self.patList[curPat]
       Xout[patIdx,:,:] = self.x[patInd,:,:,:]
       Yout[patIdx,:,:] = self.y[patInd,:,:,:]
    return Xout, Yout

  def on_epoch_end(self):
    np.random.shuffle(self.patList)

  def setBatchSize(self,batchS):
    self.batch_size = batchS

#load the data in the generator
trainGen = dataGen(X_train,Y_train,16)
valGen = dataGen(X_val,X_val,16)

# train the model for two epochs, so that the loss is bad 
trainSteps = len(trainGen)
model.fit(trainGen,steps_per_epoch=trainSteps,epochs=32,validation_data=valGen,validation_steps=len(valGen))

trainGen.setBatchSize(4)
model.evaluate(trainGen)
[0.16259156167507172, 0.9870567321777344]

trainGen.setBatchSize(16)
model.evaluate(trainGen)
[0.17035068571567535, 0.9617958068847656]

trainGen.on_epoch_end()
trainGen.setBatchSize(16)
model.evaluate(trainGen)
[0.16663715243339539, 0.9710426330566406]

If I do model.evaluate(Xtrain,Ytrain,batch_size=16) instead the result is not dependent from the batch size. If I train the model until convergence, where the loss gets to 0.05, the same thing still happens. With the accuracy fluctuating from one evaluation to the other from 0.95 to 0.99. Why would this happen? I'd expect the prediction to be super easy, am I wrong?

woebs · Accepted Answer

You made a small mistake inside the __getitem__ function.

curPat = (patStart+patIdx)

should be changed to

curPat = (patStart*batchS+patIdx)

patStart is equal to idx, the current batch number. If your data set contains 64 samples and your batch size is set to 16, the possible values for idx will be 0, 1, 2 and 3.

curPat on the other hand refers to the index of the current sample number in the shuffled list of sample numbers. curPat should therefore be able to take on all values from 0 to 63. In your code, that is not the case. By making the aforementioned change, this issue is fixed.

model.evaluate() changes results depending on batch size, when fed by generator

Answers (1)

Related Questions