How to force caffe read all training data?

Question

I'm using caffe and STILL having trouble for input.

Here is my solver.prototxt:

train_net: "auto_train.prototxt"
test_net: "auto_test.prototxt"
test_iter: 800
test_interval: 20
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
lr_policy: "inv"
gamma: 0.0001
power: 0.75
display: 100
max_iter: 10000
snapshot: 5000
snapshot_prefix: "sed"
solver_mode: GPU

And this is the running python script:

import os

PROJECT_HOME = '/home/romulus/code/project/'
CAFFE_HOME = '/home/romulus/code/caffe/'
os.chdir(PROJECT_HOME)

import sys
sys.path.insert(0, CAFFE_HOME + 'caffe/python')
import caffe, h5py
from pylab import *
from caffe import layers as L, params as P

OUTPUT_DIM = 8

def net(db, batch_size):
    n = caffe.NetSpec()
    n.data, n.label = L.Data(batch_size=batch_size, backend=P.Data.LEVELDB, source=db,
                             transform_param=dict(scale=1./255), ntop=2)
    n.ip1 = L.InnerProduct(n.data, num_output=500, weight_filler=dict(type='xavier'))
    n.relu1 = L.ReLU(n.ip1, in_place=True)
    n.ip2 = L.InnerProduct(n.relu1, num_output=500, weight_filler=dict(type='xavier'))
    n.relu2 = L.ReLU(n.ip2, in_place=True)
    n.ip3 = L.InnerProduct(n.relu2, num_output=OUTPUT_DIM, weight_filler=dict(type='xavier'))
    n.loss = L.SoftmaxWithLoss(n.ip3, n.label)

    return n.to_proto()


with open('/home/romulus/code/project/auto_train.prototxt', 'w') as f:
    f.write(str(net('/home/romulus/code/project/traindb', 64)))
with open('/home/romulus/code/project/auto_test.prototxt', 'w') as f:
    f.write(str(net('/home/romulus/code/project/testdb', 100)))

caffe.set_device(0)
caffe.set_mode_gpu()
solver = caffe.SGDSolver(PROJECT_HOME + 'auto_solver.prototxt')

solver.net.forward()  # train net
solver.test_nets[0].forward()  # test net (there can be more than one)

niter = 500
test_interval = 15
train_loss = zeros(niter)
test_acc = zeros(int(np.ceil(niter * 1.0 / test_interval)))
output = zeros((niter, 8, OUTPUT_DIM))

for it in range(niter):
    solver.step(1)  # SGD by Caffe

    train_loss[it] = solver.net.blobs['loss'].data
    solver.test_nets[0].forward(start='ip1')
    output[it] = solver.test_nets[0].blobs['ip3'].data[:8]

    if it % test_interval == 0:
        print 'Iteration', it, 'testing...'
        correct = 0

        for test_it in range(1):
            solver.test_nets[0].forward()
            correct += sum(solver.test_nets[0].blobs['ip3'].data.argmax(1)
                           == solver.test_nets[0].blobs['label'].data)
        test_acc[it // test_interval] = correct * 1.0 / len(data)

_, ax1 = subplots()
ax2 = ax1.twinx()
ax1.plot(arange(niter), train_loss)
ax2.plot(test_interval * arange(len(test_acc)), test_acc, 'r')
ax1.set_xlabel('iteration')
ax1.set_ylabel('train loss')
ax2.set_ylabel('test accuracy')
_.savefig('converge.png')

Data is generated manually, each data is an 1x256 vector, with all same scale values is 8 * label value. That means, a datum with label 3 is [ 24, 24, 24, 24, 24...., 24, 24]. I have 8 labels and totally 80000 data.

My problem is, if I put data into leveldb with label order like 0,1,2,3,4,5,6,7,8,0,1,2,3,4,5..., caffe trains the network well. But if I put ordered by 0,0,...,0,0,1,1,1,...,1,1,2,2,..., caffe fails to learn . If I reduce the test_iter in solver.prototxt to 100, caffe will always say output label is 0.

It seems like caffe does not read all the training data, but only something in front. But I can't find anything describing it except by training batch.

In fact if I increase the training batch size to 80000, caffe trains things well again. Though it is very slow and it's not so-called batch.

Can anyone helps? Thank you!

Shai · Accepted Answer

It is always a good practice to input the data in a randomized order: If your data is input is a "sorted" way, gradients will take very degenerate directions per batch yielding poor training results.

The number of training examples caffe "sees" during training is max_iter*batch_size, so if you set these two parameters to exceed the number of training examples you have, you should cover all the data you have during training.

How to force caffe read all training data?

Answers (1)

Related Questions