How does Caffe's convolution really work?

Question

So I was playing around with pycaffe's convolution function implemented as part of a basic convolution layer. Here's my convolution.prototxt file:

name: "convolution"
input: "data"
input_dim: 1
input_dim: 1
input_dim: 227
input_dim: 227

layer {
  name: "conv"
  type: "Convolution"
  bottom: "data"
  top: "conv"
  convolution_param {
    num_output: 96
    kernel_size: 11
    stride: 1
  }
}

Those parameters are the same as that of the AlexNet's first CONV layer (except the stride, which is actually 4).

I have a Macbook Pro with an NVIDIA GeForce GT 650M 1024 MB GPU. I'm not sure it means much, but my laptop also has an Intel HD 4000 as a built-in GPU.

I did a few tests on my laptop while varying the stride hyperparameter, first on GPU mode and then CPU.

1) Varying strides after calling caffe.set_device(0); caffe.set_mode_gpu():

Stride 1: 27.26 ms
Stride 2: 14.27 ms
Stride 3: 10.57 ms
Stride 4: 7.45 ms

2) Varying strides after calling caffe.set_mode_cpu():

Stride 1: 49.77 ms # expected
Stride 2: 9.92 ms # this and the results after this don't make sense
Stride 3: 4.50 ms
Stride 4: 1.96 ms

(mean of 3.)

I'm just trying to understand how Caffe's convolution works based on these tests. Can anyone help me shed light on this? Why does CPU mode execute faster than GPU mode?

Test code I used if you're interested in seeing for yourself:

import numpy as np
import caffe
import time

caffe.set_device(0)
caffe.set_mode_gpu() # caffe.set_mode_cpu()

net = caffe.Net('convolution.prototxt', caffe.TEST)
total = 0.0
for _ in range(3):
    net.blobs['data'].data[...] = np.random.randn(1, 1, 227, 227) # there really is an ellipsis there
    net.params['conv'][0].data[...] = np.random.randn(96, 1, 11, 11)
    s = time.time()
    r = net.forward()
    e = time.time()
    total += (e - s)

print total / 3 * 1000

cs95 · Accepted Answer

So, after digging around, I found out that Caffe basically uses extra memory to flatten out local regions, and then uses level3 BLAS routines (cblas_sgemm in particular) to carry out matrix multiplication to get the result. This results in speedy computation at the cost of extra memory.

References can be found here and here.

Memory operations with GPUs is, in general, much costlier than for CPUs. All the extra memory usage could be a possible explanation for the slowdown encountered when running in GPU mode. This would also depend on the GPU's specs itself.

How does Caffe's convolution really work?

Answers (1)

Related Questions

How does Caffe&#39;s convolution really work?

Answers (1)

Related Questions

How does Caffe's convolution really work?