Reputation: 11
SYCL offers NDRange and Hierarchical kernel parallelism abstractions. My questions:
Upvotes: 1
Views: 158
Reputation: 2131
To simple, GPU has multiple level of computation resources (workitem, subgroup, workgroup) so the more complicated representation is needed, like NDRange.
I talked some concepts in below video, https://www.youtube.com/watch?v=7HqbuMBUV7A&list=LL
Upvotes: 0
Reputation: 591
Is it true to claim that NDRange better mapped into GPUs hardware and Hierarchical parallelism better mapped into CPUs hardware?
This is roughly true, yes. You can find a discussion on the different kinds of parallelism in SYCL in one of our papers: https://ieeexplore.ieee.org/abstract/document/9654235
Some key points:
parallel_for_work_group
, but not in parallel_for_work_item
) only a single thread should be active. On a GPU, all the other "threads" are there anyway, so their presence needs to be masked away somehow in the outer scope.parallel_for_work_item
loop. Since parallel_for_work_item
is typically implemented as a vectorized loop, this can simplify uniformity analysis for the compiler and thus aid vectorization. In the ndrange model, the compiler will need to figure out by itself which variables are uniform across work items and which are not.Therefore, is it a realistic expectation that NDRange will achieved better performance on GPUs than Hierarchical parallelism, and on CPUs the opposite will occur?
As a rule of thumb this is probably often true, but my advice would be to run some test if possible with your SYCL stack. This is because the impact in practice will depend highly on your SYCL implementation, and the exact implementation model used. For example, ndrange parallelism on CPU with a library-only SYCL implementation may be completely non-viable, but with a compiler-driven implementation ndrange may perform very well too.
On the other hand, to my knowledge, implementations have spent comparatively little effort on optimizing the hierarchical parallelism model compared to ndrange since there is really not much code out there using it in practice.
To be honest, with modern AdaptiveCpp or DPC++ versions I would not recommend to pick hierarchical over ndrange parallelism due to performance concerns on CPU. A lot of work has gone into optimizing ndrange on both CPU and GPU, and my experience is that it is very competitive compared to hierarchical on CPU. See also this investigation here, which shows AdaptiveCpp ndrange to outperform hierarchical on CPU: https://dl.acm.org/doi/10.1145/3648115.3648130 (Slides: https://www.iwocl.org/wp-content/uploads/7601_Marcel-Breyer-University_of_Stuttgart.pdf)
The design and introduction of hierarchical parallelism predates my involvement with SYCL, but my understanding is that hierarchical parallelism was introduced because it is more convenient to formulate certain problems in it, and because a hierarchical model also maps well to our mental model of the execution hierarchy of modern hardware (e.g. chip->core->SIMD unit->vector lane).
However, implementing it on GPU is extremely complex, and we have found some design issues during the implementation of certain parts of hierarchical parallelism. This is why the SYCL specification currently explicitly discourages its use since the feature probably cannot stay in the way it is specified now.
Consequently, it has an uncertain future, and I would not recommend it for code investments, except perhaps for experiments.
Upvotes: 1