How to force caffe read all training data?

I'm using and STILL having trouble for input.

Here is my solver.prototxt:

train_net: "auto_train.prototxt"
test_net: "auto_test.prototxt"
test_iter: 800
test_interval: 20
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
lr_policy: "inv"
gamma: 0.0001
power: 0.75
display: 100
max_iter: 10000
snapshot: 5000
snapshot_prefix: "sed"
solver_mode: GPU

And this is the running python script:

import os

PROJECT_HOME = '/home/romulus/code/project/'
CAFFE_HOME = '/home/romulus/code/caffe/'
os.chdir(PROJECT_HOME)

import sys
sys.path.insert(0, CAFFE_HOME + 'caffe/python')
import caffe, h5py
from pylab import *
from caffe import layers as L, params as P

OUTPUT_DIM = 8

def net(db, batch_size):
    n = caffe.NetSpec()
    n.data, n.label = L.Data(batch_size=batch_size, backend=P.Data.LEVELDB, source=db,
                             transform_param=dict(scale=1./255), ntop=2)
    n.ip1 = L.InnerProduct(n.data, num_output=500, weight_filler=dict(type='xavier'))
    n.relu1 = L.ReLU(n.ip1, in_place=True)
    n.ip2 = L.InnerProduct(n.relu1, num_output=500, weight_filler=dict(type='xavier'))
    n.relu2 = L.ReLU(n.ip2, in_place=True)
    n.ip3 = L.InnerProduct(n.relu2, num_output=OUTPUT_DIM, weight_filler=dict(type='xavier'))
    n.loss = L.SoftmaxWithLoss(n.ip3, n.label)

    return n.to_proto()


with open('/home/romulus/code/project/auto_train.prototxt', 'w') as f:
    f.write(str(net('/home/romulus/code/project/traindb', 64)))
with open('/home/romulus/code/project/auto_test.prototxt', 'w') as f:
    f.write(str(net('/home/romulus/code/project/testdb', 100)))

caffe.set_device(0)
caffe.set_mode_gpu()
solver = caffe.SGDSolver(PROJECT_HOME + 'auto_solver.prototxt')

solver.net.forward()  # train net
solver.test_nets[0].forward()  # test net (there can be more than one)

niter = 500
test_interval = 15
train_loss = zeros(niter)
test_acc = zeros(int(np.ceil(niter * 1.0 / test_interval)))
output = zeros((niter, 8, OUTPUT_DIM))

for it in range(niter):
    solver.step(1)  # SGD by Caffe

    train_loss[it] = solver.net.blobs['loss'].data
    solver.test_nets[0].forward(start='ip1')
    output[it] = solver.test_nets[0].blobs['ip3'].data[:8]

    if it % test_interval == 0:
        print 'Iteration', it, 'testing...'
        correct = 0

        for test_it in range(1):
            solver.test_nets[0].forward()
            correct += sum(solver.test_nets[0].blobs['ip3'].data.argmax(1)
                           == solver.test_nets[0].blobs['label'].data)
        test_acc[it // test_interval] = correct * 1.0 / len(data)

_, ax1 = subplots()
ax2 = ax1.twinx()
ax1.plot(arange(niter), train_loss)
ax2.plot(test_interval * arange(len(test_acc)), test_acc, 'r')
ax1.set_xlabel('iteration')
ax1.set_ylabel('train loss')
ax2.set_ylabel('test accuracy')
_.savefig('converge.png')

Data is generated manually, each data is an 1x256 vector, with all same scale values is 8 * label value. That means, a datum with label 3 is [ 24, 24, 24, 24, 24...., 24, 24]. I have 8 labels and totally 80000 data.

My problem is, if I put data into leveldb with label order like 0,1,2,3,4,5,6,7,8,0,1,2,3,4,5..., caffe trains the network well. But if I put ordered by 0,0,...,0,0,1,1,1,...,1,1,2,2,..., caffe fails to learn . If I reduce the test_iter in solver.prototxt to 100, caffe will always say output label is 0.

It seems like caffe does not read all the training data, but only something in front. But I can't find anything describing it except by training batch.

In fact if I increase the training batch size to 80000, caffe trains things well again. Though it is very slow and it's not so-called batch.

Can anyone helps? Thank you!

Upvotes: 2

Views: 1186

Answers (1)

Shai
Shai

Reputation: 114976

It is always a good practice to input the data in a randomized order: If your data is input is a "sorted" way, gradients will take very degenerate directions per batch yielding poor training results.

The number of training examples caffe "sees" during training is max_iter*batch_size, so if you set these two parameters to exceed the number of training examples you have, you should cover all the data you have during training.

Upvotes: 3

Related Questions