Albert
Albert

Reputation: 68240

How much faster is NCHW compared to NHWC in TensorFlow/cuDNN?

The official TensorFlow performance guide states:

Most TensorFlow operations used by a CNN support both NHWC and NCHW data format. On GPU, NCHW is faster. But on CPU, NHWC is sometimes faster.

How much faster is NCHW compared to NHWC in TensorFlow/cuDNN, for convolution? Are there any references or benchmarks for this?

Also, why is it faster? As I understand (see here), TensorFlow for NHWC on GPU will internally always transpose to NCHW, then calls the cuDNN conv kernel for NCHW, then transpose it back. But why does it do that? The cuDNN conv kernel also works for NHWC. Maybe at some point they did the comparison and the cuDNN conv kernel for NHWC was very slow. But is that up-to-date? And how big was the difference? What are the technical reasons that NHWC is so much slower? Or is the cuDNN kernel for this case just not well optimized?

Upvotes: 17

Views: 16967

Answers (4)

Elinx
Elinx

Reputation: 1204

CPU side:

Let's assume their inputs and filters are all transposed to do GEMM: for NCHW the inputs shapes after im2col are W[out_channels, in_channels * filter_height * filter_width] and X[in_channels * filter_height * filter_width, out_height * out_width], for NHWC the inputs shapes after im2col are X[out_height * out_width, filter_height * filter_width * in_channels] and W[filter_height * filter_width * in_channels, out_channels], the former one will do W*X well the later one will do X*W , as you can see the difference only comes at out_channels comes first or out_height * out_width comes first, you can nearly tell any performance difference because GEMM are highly optimazed which will use some pack and tiling techquene to do small patch matrix multiplication.

The biggest palnety for NCHW is from im2col, because for NHWC you can memcpy the inner moset in_channels data while NCHW need to jump from row to row, channel to channel to get a complete patch of data(And this's also what XNNPACK do for performance improvements).

GPU side:

Don't know much about it yet.

Upvotes: 0

stephane.c
stephane.c

Reputation: 186

The reason is that most implementations of simple convolutions (not talking winograd or fft here), end up doing some kind of simple matrix multiplication, which means that in their inner loop they multiply some values from both tensors and sum the results.

On a CPU implementation, using SSE or AVX optimization, it's faster to do this along the C dimension, because you just multiply-add the values 4 by 4 or 8 by 8, and then do the reduction (sum your 4 or 8 accumulations) at the end once you added all the C dimension.

On a GPU however, doing a reduction across threads is a more costly operation (at least it was until Kepler introduced wrap-level atomic operations), so historically it has been optimized so that each thread in a wrap reads consecutive (in memory) HW values, and do the accumulation over parts of C with a loop.

Note though that the latest nvidia cards (RTX), now have tensor multiplication cores, that can process small blocks in one operation, including the reduction over a small portion of C, so on these cards it's actually faster to use NHWC (or hybrid NCHWC formats).

Upvotes: 12

Carl Thomé
Carl Thomé

Reputation: 2742

I don't think there is much of a point in manually optimising the layout, especially because data_format="channels_first" looks a lot more verbose than sticking with the default throughout TensorFlow, and because the internals should take care of it.

I'd expect at most a couple of percents faster training times with NCHW, and over time I'd expect this performance difference to go away as XLA JIT compilation matures.

With Keras you can try both pretty easily with K.set_image_data_format so try both and see what difference it makes for your particular model.

Here is a small benchmark with a VGG model https://gist.github.com/carlthome/51d62cbf5fc23098418eef93b11a5d78

Upvotes: 1

MWB
MWB

Reputation: 12577

As of TF1.1, you can't even call NHWC directly. TF does the conversion to and from NCHW. So, regardless of the efficiency of the implementation of NHWC in cuDNN, from the TF users' perspective, NCHW is faster:

https://github.com/tensorflow/tensorflow/issues/8286

The performance ratio will of course depend on the problem, but my sense of it is that it's big, and you don't want to use NHWC (on GPU), if you can avoid it (It seems likely that you'd be wasting memory too)

Upvotes: 4

Related Questions