SYCL NDRange and Hierarchical: Why one of them is not enought?

SYCL offers NDRange and Hierarchical kernel parallelism abstractions. My questions:

Is it true to claim that NDRange better mapped into GPUs hardware and Hierarchical parallelism better mapped into CPUs hardware?
Therefore, is it a realistic expectation that NDRange will achieved better performance on GPUs than Hierarchical parallelism, and on CPUs the opposite will occur?

Upvotes: 1

Answers (2)

Patric

Reputation: 2131

To simple, GPU has multiple level of computation resources (workitem, subgroup, workgroup) so the more complicated representation is needed, like NDRange.

I talked some concepts in below video, https://www.youtube.com/watch?v=7HqbuMBUV7A&list=LL

Upvotes: 0

illuhad

Reputation: 591

Is it true to claim that NDRange better mapped into GPUs hardware and Hierarchical parallelism better mapped into CPUs hardware?

This is roughly true, yes. You can find a discussion on the different kinds of parallelism in SYCL in one of our papers: https://ieeexplore.ieee.org/abstract/document/9654235

Some key points:

NDRange parallelism maps 1:1 to common GPU backends such as CUDA or OpenCL since they all have a single-program-multiple-data (SPMD) execution model where the work items are additionally grouped together in work groups. In these models, explicit barriers can occur within one workgroup.
On CPUs, implementing such an SPMD model efficiently requires non-trivial compiler transformations. This is primarily due to the existince of barriers in work groups, which means that the compiler needs to guarantee that all work items in a group can actually reach the barrier and not block each other. A detailed discussion can be found in this paper of ours: https://dl.acm.org/doi/10.1145/3585341.3585342 (Note: This is not a SYCL problem per se, as it is exactly the same e.g. in OpenCL CPU implementations, and as such there is quite some experience by now as to how to do this)
On GPUs, it is hierarchical parallelism that requires non-trivial compiler transformations to achieve the semantics that in the outer scope (in parallel_for_work_group, but not in parallel_for_work_item) only a single thread should be active. On a GPU, all the other "threads" are there anyway, so their presence needs to be masked away somehow in the outer scope.
Performance for hierarchical parallelism on GPU also depends on memory placement. SYCL says that all variables defined in the outer scope should live in local memory by default. This can have substantial negative performance impact if variables are not actually used to exchange data between work items, and therefore would not actually need to live in local memory. Not only because local memory is slower than registers, but also because the compiler would need to potentially insert additional barriers to guard local memory access. So performance might depend on how well the compiler can determine whether a variable must be placed in local memory, or can also be placed in registers.
On CPU, hierarchical parallelism has the advantage that it allows the programmer to express uniformity, i.e. variables declared in the outer work group scope are known the be uniform for all iterations of the parallel_for_work_item loop. Since parallel_for_work_item is typically implemented as a vectorized loop, this can simplify uniformity analysis for the compiler and thus aid vectorization. In the ndrange model, the compiler will need to figure out by itself which variables are uniform across work items and which are not.
Hierarchical parallelism on CPU can be implemented efficiently just with regular C++, without any special SYCL compiler features. If you have a SYCL implementation that is "library-only", i.e. is implemented as "just a C++ library" for third-party CPU compilers, hierarchical can be orders of magnitude faster than ndrange. This implementation model is a bit exotic nowadays though, as both major SYCL implementations AdaptiveCpp/DPC++ by default are not library-only.

Therefore, is it a realistic expectation that NDRange will achieved better performance on GPUs than Hierarchical parallelism, and on CPUs the opposite will occur?

As a rule of thumb this is probably often true, but my advice would be to run some test if possible with your SYCL stack. This is because the impact in practice will depend highly on your SYCL implementation, and the exact implementation model used. For example, ndrange parallelism on CPU with a library-only SYCL implementation may be completely non-viable, but with a compiler-driven implementation ndrange may perform very well too.

On the other hand, to my knowledge, implementations have spent comparatively little effort on optimizing the hierarchical parallelism model compared to ndrange since there is really not much code out there using it in practice.

To be honest, with modern AdaptiveCpp or DPC++ versions I would not recommend to pick hierarchical over ndrange parallelism due to performance concerns on CPU. A lot of work has gone into optimizing ndrange on both CPU and GPU, and my experience is that it is very competitive compared to hierarchical on CPU. See also this investigation here, which shows AdaptiveCpp ndrange to outperform hierarchical on CPU: https://dl.acm.org/doi/10.1145/3648115.3648130 (Slides: https://www.iwocl.org/wp-content/uploads/7601_Marcel-Breyer-University_of_Stuttgart.pdf)

The design and introduction of hierarchical parallelism predates my involvement with SYCL, but my understanding is that hierarchical parallelism was introduced because it is more convenient to formulate certain problems in it, and because a hierarchical model also maps well to our mental model of the execution hierarchy of modern hardware (e.g. chip->core->SIMD unit->vector lane).

However, implementing it on GPU is extremely complex, and we have found some design issues during the implementation of certain parts of hierarchical parallelism. This is why the SYCL specification currently explicitly discourages its use since the feature probably cannot stay in the way it is specified now.

Consequently, it has an uncertain future, and I would not recommend it for code investments, except perhaps for experiments.

Upvotes: 1

SYCL NDRange and Hierarchical: Why one of them is not enought?

Answers (2)

Related Questions