Reputation: 7273
I am trying to run my program with OpenCL.
I have seen the following information in the log:
OpenCL device #0: GPU NVIDIA Corporation GeForce GT 730 with OpenCL 1.2 (2 units, 901 MHz, 4096 Mb, version 391.35)
OpenCL device #1: GPU NVIDIA Corporation GeForce GT 730 with OpenCL 1.2 (2 units, 901 MHz, 4096 Mb, version 391.35)
OpenCL device #2: CPU Intel(R) Corporation Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz with OpenCL 2.1 (8 units, 4000 MHz, 16300 Mb, version 7.0.0.2567)
What I guess from the above information, is that my GPU device has 2 units each as work item.
After checking the specification of my GPU device using CudaZ utility, I see that I have 384 Cores reported for a GPU device in a [ PCI_LOC=0:1:0 ].
See the image:
The clinfo
show the following: gist of clinfo
My question is that, when I am having 384 cores each, then why there are 2 units displayed? Secondly, when I have many cores, how openCL is distributing the task, is it on each core same process and same data or is it different core with different data?
Upvotes: 0
Views: 858
Reputation: 1
My question is that, when I am having 384 cores each, then why there are 2 units displayed ?
Easy: GPU computing devices are different, having other silicon-hardwired architectures, than any universal CPU CISC/RISC computing devices.
The reason why is very important here.
GPU devices use Streaming Multiprocessor eXecution units ( SMX units ), that are referred in some hardware-inspection tools.
While the letter M in the SMX abbreviation emphasises, there are multiple executions loadable onto the SMX-unit, yet, all such cases actually do execute ( sure, only if instructed in such a manner, which goes outside of the scope of this topic, to cover / span all over each of the SMX-present SM-cores ) the very same computing instructions - this is the only way they can operate - it is called a SIMD-type of limited scope of parallelism achievable ( co-locally ) on the perimeter of the SMX only, where single-instruction-multiple-data can become executed within a present SIMD-( WARP-wide | half-WARP-wide )-scheduler capabilities.
Having listed those 384 cores, posted above, means a hardware limit, beyond which this co-locally orchestrated SIMD-type of limited-scope parallelism cannot grow, and all attempts into this direction will lead to a pure-[SERIAL]
internal scheduling of GPU-jobs ( yes, i.e. one-after-another ).
Understanding these basics is cardinal, as without these architecture features, one may expect a behaviour, that is actually principally impossible to get orchestrated in any whatever kind of the GPGPU system, having a formal shape of [ 1-CPU-host : N-GPU-device(s) ]
compositions of autonomous, asynchronous distributed-system star-of-nodes.
Any GPU-kernel loaded from a CPU-host onto GPU will get mapped onto a non-empty set of SMX-unit(s), where a specified number of cores ( another, finer grain geometry-of-computing resources is applied, again going way beyond the scope of this post ) gets loaded with a stream of SIMD-instructions, not violating the GPU-device limits:
...
+----------------------------------------------------------------------------------------
Max work items dimensions: 3 // 3D-geometry grids possible
Max work items[0]: 1024 // 1st dimension max.
Max work items[1]: 1024
Max work items[2]: 64 // theoretical max. 1024 x 1024 x 64 BUT...
+----------------------------------------------------------------------------------------
Max work group size: 1024 // actual max. "geometry"-size
+----------------------------------------------------------------------------------------
...
So,
if 1-SM-core was internally instructed to execute some GPU-task unit ( a GPU-job ), just this one SM-core will fetch one GPU-RISC-instruction after another ( ignoring any possible ILP for the simplicity here ) and execute it one at a time, stepping through the stream of SIMD-instructions of the said GPU-job. All the rest of the SM-cores present on the same SMX-unit typically do nothing during that time, until this GPU-job get finished and the internal GPU-process management system decides about mapping some other work for this SMX.
if 2-SM-cores were instructed to execute some GPU-job, just this pair of SM-cores will fetch one ( and the very same ) GPU-RISC-instruction after another ( ignoring any possible ILP for the simplicity here ) and both execute it one at a time, stepping through the stream of SIMD-instructions of the said GPU-job. In this case, if one SM-core gets into a condition, where an if
-ed, or similarly branched, flow of execution makes one SM-core into going into another code-execution-flow path than the other, the SIMD-parallelism gets into divergent scenario, where one SM-core gets a next SIMD-instruction, belonging to it's code-execution path, whereas the other one does nothing ( gets a GPU_NOP(s) ), until the first one finished the whole job ( or was enforced to stop at some synchronisation barrier of fell into an unmaskable latency wait-state, when waiting for a piece of data to get fetched from "far" ( slow ) non-local memory location, again, details go way beyond the scope of this post ) - only after any one of this happens, the divergent-path, so far just GPU_NOP-ed SM-core can receive any next SIMD-instruction, belonging to its ( divergent ) code-execution-path to move any forward. All the rest of the SM-cores present on the same SMX-unit typically do nothing during that time, until this GPU-job get finished and the internal GPU-process management system decides about mapping some other work for this SMX.
if 16-SM-cores were instructed to execute some GPU-job by the task-specific "geometry", just this "herd" of SM-cores will fetch one ( and the very same ) GPU-RISC-instruction after another ( ignoring any possible ILP for the simplicity here ) and all execute it one at a time, stepping through the stream of SIMD-instructions of the said GPU-job. Any divergence inside the "herd" reduce the SIMD-effect and GPU_NOP
-blocked cores remain waiting for the main part of the "herd" to finish the job ( same as was sketched right above this point ).
Anyway, all the other SM-cores, not mapped by the task-specific "geometry" on the respective GPU-devices' SMX-unit will typically remain doing nothing useful at all - so the importance of knowing the hardware details for the proper task-specific "geometry" is indeed important and profiling may help to identify the peak performance for any such GPU-task constellation ( differences may range several orders of magnitude - from best to common to worse - among all possible task-specific "geometry" setups ).
Secondly, when I have many cores, how openCL is distributing the task, is it on each core same process and same data or is it different core with different data ?
As explained in brief above - the SIMD-type device silicon-architecture does not permit any of the SMX SM-cores to execute anything other than the very same SIMD-instruction on the whole "herd"-of-SM-cores, that was mapped by a task-"geometry" onto the SMX-unit ( not counting the GPU_NOP
(s) as doing " something else " as it is just wasting CPU:GPU-system time ).
So, yes, " .. on each core same process .. " ( best if never divergent in its internal code-execution paths after if
or while
or any other kind of code-execution path branching ), so if algorithm, based on data-driven values results in different internal state, each core may have different thread-local-state, based on which the processing may differ ( as exemplified with if
-driven divergent code-execution paths above ). More details on SM-local registers, SM-local caching, restricted shared-memory usage ( and latency costs ), GPU-device global-memory usage ( and latency costs and cache-line lengths and associativity for best coalescing access-patterns for latency masking options - many hardware-related + programming eco-system details go into small thousands of pages of hardware + software specific documentation and are well beyond the scope of this simplified for clarity post )
same data or is it different core with different data ?
This is the last, but not least, dilemma - any well parameterised GPU-kernel activation may also pass some amount of external-world data down to the GPU-kernel, which may make SMX thread-local data differend from SM-core to SM-core. Mapping practices and best performance for doing this are principally device specific ( { SMX | SM-registers | GPU_GDDR gloMEM : shaMEM : constMEM | GPU SMX-local cache-hierarchy }-details and capacities
...
+---------------------------------------------------------
... 901 MHz
Cache type: Read/Write
Cache line size: 128
Cache size: 32768
Global memory size: 4294967296
Constant buffer size: 65536
Max number of constant args: 9
Local memory size: 49152
+---------------------------------------------------------
... 4000 MHz
Cache type: Read/Write
Cache line size: 64
Cache size: 262144
Global memory size: 536838144
Constant buffer size: 131072
Max number of constant args: 480
Local memory size: 32768
+---------------------------------------------------------
... 1300 MHz
Cache type: Read/Write
Cache line size: 64
Cache size: 262144
Global memory size: 1561123226
Constant buffer size: 65536
Max number of constant args: 8
Local memory size: 65536
+---------------------------------------------------------
... 4000 MHz
Cache type: Read/Write
Cache line size: 64
Cache size: 262144
Global memory size: 2147352576
Constant buffer size: 131072
Max number of constant args: 480
Local memory size: 32768
are principally so different device to device, that each high-performance code project principally can but profile its respective GPU-device task-"geometry and resources-usage maps composition for actual deployment device. What may work faster on one GPU-device / GPU-drives stack, need not work as smart on another one ( or after GPU-driver + exo-programming ecosystem update / upgrade ), simply only the real-life benchmark will tell ( as theory could be easily printed, but hardly as easily executed, as many device-specific and workload-injected limitations will apply in real-life deployment ).
Upvotes: 1