Junge
Junge

Reputation: 457

caffe: "Check failed: status == CUDNN_STATUS_SUCCESS (3 vs. 0) CUDNN_STATUS_BAD_PARAM" during training

I am getting into the programming of networks with caffe and since I am used to more comfortable and "lazy" solutions I am a bit overwhelmed by the problems that can occur.

Right now I am getting the error Check failed: status == CUDNN_STATUS_SUCCESS (3 vs. 0) CUDNN_STATUS_BAD_PARAM

This one is quite well known to be produced by bad cuda or cudnn versions. So i checked those and they are up to date. (Cuda: 8.0.61 Cudnn: 6.0.21)

Since I will only get this error when I add this ReLU layer I suppose it is caused by me confusing a parameter:

layer{
name: "relu1"
type: "ReLU"
bottom: "pool1"
top: "relu1"
}

And to give you all the information, here is the error message I get:

I0319 09:41:09.484148  6909 solver.cpp:44] Initializing solver from parameters:
test_iter: 10
test_interval: 1000
base_lr: 0.001
display: 20
max_iter: 800
lr_policy: "step"
gamma: 0.1
momentum: 0.9
weight_decay: 0.04
stepsize: 200
snapshot: 10000
snapshot_prefix: "models/train"
solver_mode: GPU
net: "train_val.prototxt"
I0319 09:41:09.484392  6909 solver.cpp:87] Creating training net from net file: train_val.prototxt
I0319 09:41:09.485164  6909 net.cpp:294] The NetState phase (0) differed from the phase (1) specified by a rule in layer feed2
I0319 09:41:09.485183  6909 net.cpp:51] Initializing net from parameters:
name: "CaffeNet"
state {
  phase: TRAIN
}
layer {
  name: "feed"
  type: "HDF5Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  hdf5_data_param {
    source: "train_h5_list.txt"
    batch_size: 50
  }
}
layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  convolution_param {
    num_output: 1
    kernel_size: 3
    stride: 1
    weight_filler {
      type: "gaussian"
    }
    bias_filler {
      type: "constant"
    }
  }
}
layer {
  name: "pool1"
  type: "Pooling"
  bottom: "conv1"
  top: "pool1"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 1
  }
}
layer {
  name: "relu1"
  type: "ReLU"
  bottom: "pool1"
  top: "relu1"
}
layer {
  name: "conv2"
  type: "Convolution"
  bottom: "relu1"
  top: "conv2"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  convolution_param {
    num_output: 1
    kernel_size: 3
    stride: 1
    weight_filler {
      type: "gaussian"
    }
    bias_filler {
      type: "constant"
    }
  }
}
layer {
  name: "ip2"
  type: "InnerProduct"
  bottom: "conv2"
  top: "ip2"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  inner_product_param {
    num_output: 1
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
layer {
  name: "sig1"
  type: "Sigmoid"
  bottom: "ip2"
  top: "sig1"
}
layer {
  name: "loss"
  type: "EuclideanLoss"
  bottom: "sig1"
  bottom: "label"
  top: "loss"
}
I0319 09:41:09.485752  6909 layer_factory.hpp:77] Creating layer feed
I0319 09:41:09.485780  6909 net.cpp:84] Creating Layer feed
I0319 09:41:09.485792  6909 net.cpp:380] feed -> data
I0319 09:41:09.485819  6909 net.cpp:380] feed -> label
I0319 09:41:09.485836  6909 hdf5_data_layer.cpp:80] Loading list of HDF5 filenames from: train_h5_list.txt
I0319 09:41:09.485860  6909 hdf5_data_layer.cpp:94] Number of HDF5 files: 1
I0319 09:41:09.486469  6909 hdf5.cpp:32] Datatype class: H5T_FLOAT
I0319 09:41:09.500986  6909 net.cpp:122] Setting up feed
I0319 09:41:09.501011  6909 net.cpp:129] Top shape: 50 227 227 3 (7729350)
I0319 09:41:09.501027  6909 net.cpp:129] Top shape: 50 1 (50)
I0319 09:41:09.501039  6909 net.cpp:137] Memory required for data: 30917600
I0319 09:41:09.501051  6909 layer_factory.hpp:77] Creating layer conv1
I0319 09:41:09.501080  6909 net.cpp:84] Creating Layer conv1
I0319 09:41:09.501087  6909 net.cpp:406] conv1 <- data
I0319 09:41:09.501101  6909 net.cpp:380] conv1 -> conv1
I0319 09:41:09.880740  6909 net.cpp:122] Setting up conv1
I0319 09:41:09.880765  6909 net.cpp:129] Top shape: 50 1 225 1 (11250)
I0319 09:41:09.880781  6909 net.cpp:137] Memory required for data: 30962600
I0319 09:41:09.880808  6909 layer_factory.hpp:77] Creating layer pool1
I0319 09:41:09.880836  6909 net.cpp:84] Creating Layer pool1
I0319 09:41:09.880846  6909 net.cpp:406] pool1 <- conv1
I0319 09:41:09.880861  6909 net.cpp:380] pool1 -> pool1
I0319 09:41:09.880888  6909 net.cpp:122] Setting up pool1
I0319 09:41:09.880899  6909 net.cpp:129] Top shape: 50 1 224 0 (0)
I0319 09:41:09.880913  6909 net.cpp:137] Memory required for data: 30962600
I0319 09:41:09.880921  6909 layer_factory.hpp:77] Creating layer relu1
I0319 09:41:09.880934  6909 net.cpp:84] Creating Layer relu1
I0319 09:41:09.880941  6909 net.cpp:406] relu1 <- pool1
I0319 09:41:09.880952  6909 net.cpp:380] relu1 -> relu1
F0319 09:41:09.881192  6909 cudnn.hpp:80] Check failed: status == CUDNN_STATUS_SUCCESS (3 vs. 0)  CUDNN_STATUS_BAD_PARAM

EDIT: Tried setting the solver mode to CPU, I still get this error.

Upvotes: 1

Views: 2811

Answers (2)

ObnoxiousPlum
ObnoxiousPlum

Reputation: 26

The reason why it is throwing this error is because you have no more room to "shrink". From your error message: 50 1 224 0 (0) This indicates the size of the net has a 0 in one dimension.

To fix this error, you can tweak some of the parameters, including (S)tride, (K)ernel size, and (P)adding. To calculate the dimensions of your next layer (W_new), you can use the formula:

W_new = (W_old - K + 2*P)/S + 1

So, if we have an input that is 227x227x3 and our first layer has K = 5, S = 2, P = 1, and numOutputs = N, conv1 then has a dimension that is:

(227-5+2*1)/2 + 1 = 112x112xN.

Note: if you end up with an odd number in the numerator, round up after adding 1.

Edit: The reason why it's showing up with the ReLU layer is likely because the ReLU layer has nothing to pass through, ergo it throws an error.

Upvotes: 1

Junge
Junge

Reputation: 457

I found out one of the problems.

I0319 09:41:09.880765  6909 net.cpp:129] Top shape: 50 1 225 1 (11250)
I0319 09:41:09.880781  6909 net.cpp:137] Memory required for data: 30962600
I0319 09:41:09.880808  6909 layer_factory.hpp:77] Creating layer pool1
I0319 09:41:09.880836  6909 net.cpp:84] Creating Layer pool1
I0319 09:41:09.880846  6909 net.cpp:406] pool1 <- conv1
I0319 09:41:09.880861  6909 net.cpp:380] pool1 -> pool1
I0319 09:41:09.880888  6909 net.cpp:122] Setting up pool1
I0319 09:41:09.880899  6909 net.cpp:129] Top shape: 50 1 224 0 (0)

As you can see the first Convolutional layer will take an input of size (50 227 227 3), wich is a bit problematic, since he thinks that the second dimension contains the channels.

Its only natural that this convolutional layer will simply butcher the dimensions that way and now no further layer after that will get proper input dimensions.

I managed to solve the problem by simply reshaping the input this way:

layer {
    name: "reshape"
    type: "Reshape"
    bottom: "data"
    top: "res"
    reshape_param {
      shape {
        dim: 50
        dim: 3
        dim: 227
        dim: 227
      }
    }
  }

the first dimension in this is the batch size, so whoever reads this has to remember to set this dim to 1 in the .prototxt file for the classification phase (since that one won't work with batches)

EDIT: I will mark this as an answer since it covers the basic solution to the problem i had and no other solution is in sight. If anyone wants to shine more light on the matter, please do so.

Upvotes: 2

Related Questions