Jaffer Wilson
Jaffer Wilson

Reputation: 7273

OpenCL shows 2 units whereas my GPU device physically has many cores - why?

I am trying to run my program with OpenCL.

I have seen the following information in the log:

OpenCL device #0: GPU NVIDIA Corporation GeForce GT 730 with OpenCL 1.2 (2 units, 901 MHz, 4096 Mb, version 391.35)
OpenCL device #1: GPU NVIDIA Corporation GeForce GT 730 with OpenCL 1.2 (2 units, 901 MHz, 4096 Mb, version 391.35)
OpenCL device #2: CPU Intel(R) Corporation Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz with OpenCL 2.1 (8 units, 4000 MHz, 16300 Mb, version 7.0.0.2567)

What I guess from the above information, is that my GPU device has 2 units each as work item.

After checking the specification of my GPU device using CudaZ utility, I see that I have 384 Cores reported for a GPU device in a [ PCI_LOC=0:1:0 ].

See the image:

my GPU specification

The clinfo show the following: gist of clinfo

My question is that, when I am having 384 cores each, then why there are 2 units displayed? Secondly, when I have many cores, how openCL is distributing the task, is it on each core same process and same data or is it different core with different data?

Upvotes: 0

Views: 858

Answers (1)

user3666197
user3666197

Reputation: 1

My question is that, when I am having 384 cores each, then why there are 2 units displayed ?

Easy: GPU computing devices are different, having other silicon-hardwired architectures, than any universal CPU CISC/RISC computing devices.

The reason why is very important here.

GPU devices use Streaming Multiprocessor eXecution units ( SMX units ), that are referred in some hardware-inspection tools.

While the letter M in the SMX abbreviation emphasises, there are multiple executions loadable onto the SMX-unit, yet, all such cases actually do execute ( sure, only if instructed in such a manner, which goes outside of the scope of this topic, to cover / span all over each of the SMX-present SM-cores ) the very same computing instructions - this is the only way they can operate - it is called a SIMD-type of limited scope of parallelism achievable ( co-locally ) on the perimeter of the SMX only, where single-instruction-multiple-data can become executed within a present SIMD-( WARP-wide | half-WARP-wide )-scheduler capabilities.

Having listed those 384 cores, posted above, means a hardware limit, beyond which this co-locally orchestrated SIMD-type of limited-scope parallelism cannot grow, and all attempts into this direction will lead to a pure-[SERIAL] internal scheduling of GPU-jobs ( yes, i.e. one-after-another ).

Understanding these basics is cardinal, as without these architecture features, one may expect a behaviour, that is actually principally impossible to get orchestrated in any whatever kind of the GPGPU system, having a formal shape of [ 1-CPU-host : N-GPU-device(s) ] compositions of autonomous, asynchronous star-of-nodes.

Any GPU-kernel loaded from a CPU-host onto GPU will get mapped onto a non-empty set of SMX-unit(s), where a specified number of cores ( another, finer grain geometry-of-computing resources is applied, again going way beyond the scope of this post ) gets loaded with a stream of SIMD-instructions, not violating the GPU-device limits:

 ...
+----------------------------------------------------------------------------------------
 Max work items dimensions:          3       // 3D-geometry grids possible
    Max work items[0]:               1024    // 1st dimension max.
    Max work items[1]:               1024
    Max work items[2]:               64      // theoretical max. 1024 x 1024 x 64 BUT...
+----------------------------------------------------------------------------------------
 Max work group size:                1024    // actual      max. "geometry"-size
+----------------------------------------------------------------------------------------
 ...

So,

  • if 1-SM-core was internally instructed to execute some GPU-task unit ( a GPU-job ), just this one SM-core will fetch one GPU-RISC-instruction after another ( ignoring any possible ILP for the simplicity here ) and execute it one at a time, stepping through the stream of SIMD-instructions of the said GPU-job. All the rest of the SM-cores present on the same SMX-unit typically do nothing during that time, until this GPU-job get finished and the internal GPU-process management system decides about mapping some other work for this SMX.

  • if 2-SM-cores were instructed to execute some GPU-job, just this pair of SM-cores will fetch one ( and the very same ) GPU-RISC-instruction after another ( ignoring any possible ILP for the simplicity here ) and both execute it one at a time, stepping through the stream of SIMD-instructions of the said GPU-job. In this case, if one SM-core gets into a condition, where an if-ed, or similarly branched, flow of execution makes one SM-core into going into another code-execution-flow path than the other, the SIMD-parallelism gets into divergent scenario, where one SM-core gets a next SIMD-instruction, belonging to it's code-execution path, whereas the other one does nothing ( gets a GPU_NOP(s) ), until the first one finished the whole job ( or was enforced to stop at some synchronisation barrier of fell into an unmaskable latency wait-state, when waiting for a piece of data to get fetched from "far" ( slow ) non-local memory location, again, details go way beyond the scope of this post ) - only after any one of this happens, the divergent-path, so far just GPU_NOP-ed SM-core can receive any next SIMD-instruction, belonging to its ( divergent ) code-execution-path to move any forward. All the rest of the SM-cores present on the same SMX-unit typically do nothing during that time, until this GPU-job get finished and the internal GPU-process management system decides about mapping some other work for this SMX.

  • if 16-SM-cores were instructed to execute some GPU-job by the task-specific "geometry", just this "herd" of SM-cores will fetch one ( and the very same ) GPU-RISC-instruction after another ( ignoring any possible ILP for the simplicity here ) and all execute it one at a time, stepping through the stream of SIMD-instructions of the said GPU-job. Any divergence inside the "herd" reduce the SIMD-effect and GPU_NOP-blocked cores remain waiting for the main part of the "herd" to finish the job ( same as was sketched right above this point ).

Anyway, all the other SM-cores, not mapped by the task-specific "geometry" on the respective GPU-devices' SMX-unit will typically remain doing nothing useful at all - so the importance of knowing the hardware details for the proper task-specific "geometry" is indeed important and profiling may help to identify the peak performance for any such GPU-task constellation ( differences may range several orders of magnitude - from best to common to worse - among all possible task-specific "geometry" setups ).


Secondly, when I have many cores, how openCL is distributing the task, is it on each core same process and same data or is it different core with different data ?

As explained in brief above - the SIMD-type device silicon-architecture does not permit any of the SMX SM-cores to execute anything other than the very same SIMD-instruction on the whole "herd"-of-SM-cores, that was mapped by a task-"geometry" onto the SMX-unit ( not counting the GPU_NOP(s) as doing " something else " as it is just wasting CPU:GPU-system time ).

So, yes, " .. on each core same process .. " ( best if never divergent in its internal code-execution paths after if or while or any other kind of code-execution path branching ), so if algorithm, based on data-driven values results in different internal state, each core may have different thread-local-state, based on which the processing may differ ( as exemplified with if-driven divergent code-execution paths above ). More details on SM-local registers, SM-local caching, restricted shared-memory usage ( and latency costs ), GPU-device global-memory usage ( and latency costs and cache-line lengths and associativity for best coalescing access-patterns for latency masking options - many hardware-related + programming eco-system details go into small thousands of pages of hardware + software specific documentation and are well beyond the scope of this simplified for clarity post )

same data or is it different core with different data ?

This is the last, but not least, dilemma - any well parameterised GPU-kernel activation may also pass some amount of external-world data down to the GPU-kernel, which may make SMX thread-local data differend from SM-core to SM-core. Mapping practices and best performance for doing this are principally device specific ( { SMX | SM-registers | GPU_GDDR gloMEM : shaMEM : constMEM | GPU SMX-local cache-hierarchy }-details and capacities

  ...
 +---------------------------------------------------------
  ...                                               901 MHz
  Cache type:                            Read/Write
  Cache line size:                     128
  Cache size:                        32768
  Global memory size:           4294967296
  Constant buffer size:              65536
  Max number of constant args:           9
  Local memory size:                 49152
 +---------------------------------------------------------
  ...                                              4000 MHz
  Cache type:                            Read/Write
  Cache line size:                      64
  Cache size:                       262144
  Global memory size:            536838144
  Constant buffer size:             131072
  Max number of constant args:         480
  Local memory size:                 32768
 +---------------------------------------------------------
  ...                                              1300 MHz
  Cache type:                            Read/Write
  Cache line size:                      64
  Cache size:                       262144
  Global memory size:           1561123226
  Constant buffer size:              65536
  Max number of constant args:           8
  Local memory size:                 65536
 +---------------------------------------------------------
  ...                                              4000 MHz
  Cache type:                            Read/Write
  Cache line size:                      64
  Cache size:                       262144
  Global memory size:           2147352576
  Constant buffer size:             131072
  Max number of constant args:         480
  Local memory size:                 32768

are principally so different device to device, that each high-performance code project principally can but profile its respective GPU-device task-"geometry and resources-usage maps composition for actual deployment device. What may work faster on one GPU-device / GPU-drives stack, need not work as smart on another one ( or after GPU-driver + exo-programming ecosystem update / upgrade ), simply only the real-life benchmark will tell ( as theory could be easily printed, but hardly as easily executed, as many device-specific and workload-injected limitations will apply in real-life deployment ).

Upvotes: 1

Related Questions