FHoenig
FHoenig

Reputation: 359

Difference between surface memory Kepler vs Maxwell

Given the following low level (SASS) instructions on the latest two generation of NVIDIA GPUs (ref http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html), what are the (perhaps speculated) differences in the hardware / memory hierarchy design (and performance implications) ?

Surface Memory Instructions MAXWELL

SUATOM  Surface Reduction
SULD    Surface Load
SURED   Atomic Reduction on surface memory
SUST    Surface Store

Surface Memory Instructions KEPLER

SUCLAMP Surface Clamp
SUBFM   Surface Bit Field Merge
SUEAU   Surface Effective Address
SULDGA  Surface Load Generic Address
SUSTGA  Surface Store Generic Address

Upvotes: 0

Views: 481

Answers (1)

ArchaeaSoftware
ArchaeaSoftware

Reputation: 4422

CUDA arrays wrap NVIDIA's proprietary array layouts that are optimized for 2D and 3D locality. The translation from coordinates to an address is intentionally obfuscated from developers, since it may change from one architecture to the next. It looks like NVIDIA chose to wrap this translation differently from Kepler to Maxwell, with Kepler implementing a more "RISC-like" approach. The SASS disassembly of the surf2dmemset sample from the CUDA Handbook (https://github.com/ArchaeaSoftware/cudahandbook/blob/master/texturing/surf2Dmemset.cu) shows 6 instructions to write the output:

 SUCLAMP PT, R8, R7, c[0x0][0x164], 0x0;
 SUCLAMP.SD.R4 PT, R6, R6, c[0x0][0x15c], 0x0;
 IMADSP.SD R9, R8, c[0x0][0x160], R6;
 SUBFM P0, R8, R6, R8, R9;
 SUEAU R9, R9, R8, c[0x0][0x154];
 SUSTGA.B.32.TRAP.U8 [R8], c[0x0][0x158], R10, P0;

as compared to one for Maxwell:

 SUST.D.BA.2D.TRAP [R2], R8, 0x55;

The "EA" in the Kepler instructions stands for "effective address," it's a more-complicated variant of the LEA (load effective address) instruction in CISC instruction sets.

As for SURED/SUATOM, those must be the surface equivalents to GRED/GATOM. Both perform atomic operations, but the ATOM variants return the previous value of the memory location and the RED variants do not. They don't need different intrinsics; the compiler emits the correct instruction automatically.

Upvotes: 3

Related Questions