Reputation: 359
I'm trying to develop a better intuition of the mapping between OpenCL's abstraction and the actual hardware. For instance, using the late-2011 Macbook pro's configuration:
1)
Radeon 6770M GPU: http://www.amd.com/us/products/notebook/graphics/amd-radeon-6000m/amd-radeon-6700m-6600m/Pages/amd-radeon-6700m-6600m.aspx#2
"480 Stream Processors" I guess is the important number there.
2)
On the other hand the OpenCL API gives me these numbers:
DEVICE_NAME = ATI Radeon HD 6770M
DRIVER_VERSION = 1.0
DEVICE_VENDOR = AMD
DEVICE_VERSION = OpenCL 1.1
DEVICE_MAX_COMPUTE_UNITS = 6
DEVICE_MAX_CLOCK_FREQUENCY = 675
DEVICE_GLOBAL_MEM_SIZE = 1073741824
DEVICE_LOCAL_MEM_SIZE = 32768
CL_DEVICE_ADDRESS_BITS = 32
CL_DEVICE_GLOBAL_MEM_CACHE_SIZE = 0
CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE = 0
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE = 65536
CL_DEVICE_MAX_WORK_GROUP_SIZE = 1024
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
CL_DEVICE_MAX_WORK_ITEM_SIZES = (1024, 1024, 1024)
And querying the work group size and multiple for a trivial kernel (pass-through float4 form input to output global mem)
CL_KERNEL_PREFERRED_WORKGROUP_SIZE_MULTIPLE = 64
CL_KERNEL_WORK_GROUP_SIZE = 256
3)
The OpenCL specification states that an entire work group must be able to run concurrently on a device's compute unit.
4)
OpenCL also give the device's SIMD-width through the multiple, which is 64 in the above case.
Somehow I cannot put the "6" the "480" and powers of two in relationship. If the number of compute units is 6 and the SIMD width is 64 I get to 384.
Can anybody explain how these numbers relate, especially to hardware?
Upvotes: 0
Views: 897
Reputation: 9886
In this GPU, each "compute unit" is a core executing one or more work-groups.
The max size of each work-group is 256 for your specific kernel (obtained with clGetKernelWorkgroupInfo). it can be less if your kernel requires more resources (registers, local memory).
In each core, 16 work-items are physically active at a given time, and execute the same "large instruction" (see VLIW5) mapped on 5 arithmetic units (ALU), that gives 5*16 ALU per core or 480 "stream processors" for the 6 cores.
Work-items are actually executed in blocks of 64 (a "wavefront" in AMD terminology); all 64 work-items executing the same VLIW5 instruction, and being physically executed in 4 passes of 16. This is why you get a preferred workgroup size multiple of 64.
Recent AMD GPUs have switched to a VLIW4 model, where each instruction maps to only 4 ALU.
Upvotes: 1