[RFC] Add RFC for SYCL & ESIMD GPU radix sort#2643
Conversation
- Please note this draft was planned and executed by an AI agent with my guidance and review Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
This reverts commit 5c4f59a.
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
There was a problem hiding this comment.
Pull request overview
Adds an RFC documenting the design, safety model, and expected performance characteristics of the experimental oneDPL kernel-template GPU radix sort implementations (ESIMD for PVC and a forward-progress-based pure SYCL variant).
Changes:
- Documented the exposed kernel-template radix sort APIs (keys-only and key-value variants) and tuning parameters.
- Described the onesweep + decoupled-lookback algorithm, including forward-progress safety considerations for SYCL.
- Added platform support notes, testing matrix, and open questions / exit criteria.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
rfcs/experimental/kernel_templates/radix_sort_esimd_sycl/README.md
Outdated
Show resolved
Hide resolved
rfcs/experimental/kernel_templates/radix_sort_esimd_sycl/README.md
Outdated
Show resolved
Hide resolved
rfcs/experimental/kernel_templates/radix_sort_esimd_sycl/README.md
Outdated
Show resolved
Hide resolved
rfcs/experimental/kernel_templates/radix_sort_esimd_sycl/README.md
Outdated
Show resolved
Hide resolved
rfcs/experimental/kernel_templates/radix_sort_esimd_sycl/README.md
Outdated
Show resolved
Hide resolved
|
|
||
| ### Platform Support | ||
|
|
||
| The ESIMD implementation is limited to Intel PVC architecture with a work-group size of 64 and data per work item |
There was a problem hiding this comment.
The ESIMD work-group size constraint is documented as fixed at 64, but the implementation allows 32 or 64 (see __check_esimd_sort_params in include/oneapi/dpl/experimental/kt/internal/esimd_radix_sort_utils.h). Please align the RFC with the actual constraint (or clarify if 32 is unsupported for the onesweep path).
| The ESIMD implementation is limited to Intel PVC architecture with a work-group size of 64 and data per work item | |
| The ESIMD implementation is limited to Intel PVC architecture with a work-group size of 32 or 64 and data per work item |
There was a problem hiding this comment.
@mmichel11 is and data per work item here stating that the implementation is limited to 64 data elements per work item? It reads to me like a dangling phrase and I would suggest repeating the size limitation if that is correct.
There was a problem hiding this comment.
The full sentence wraps around to the next line and is and data per work item multiple of 32. meaning only work group size of 64 is supported and data per work item must be a multiple of 32. I can rephrase if it is still confusing.
@dmitriy-sobolev Could you let me know what the actual limitation for ESIMD work group size is? I see our implementation enables 32 and 64 but our current documentation states that only 64 is supported and that is the only test scenario I see.
There was a problem hiding this comment.
64 is the actual limitation. There were issues with 32-bit in some rare cases, that's why it was not stated as supported.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
| The returned event or underlying queue must be waited on to guarantee algorithm completion. | ||
|
|
||
| Unified dispatch logic in `radix_sort_dispatchers.h` and `radix_sort_submitters.h` selects the appropriate | ||
| implementation and optimization path based on the input size and available hardware features. |
There was a problem hiding this comment.
It would be good to clarify that you mean that these serve both ESIMD and sycl variants.
I'm not sure if you want to get into discussing the choice of implementation / optimization based on input size here. If so, you may need to go into a bit more detail about what you mean, or link to the section which does.
| | Work-group size | 64 | 512, 1024 | | ||
| | Data per work-item | Multiples of 32 (limited by SLM capacity) | Multiples of 1 (limited by SLM capacity) | | ||
| | Radix bits | 8 | 8 | | ||
|
|
There was a problem hiding this comment.
It might be useful to briefly describe or link to how ESIMD vs sycl workgroup size and data per work-item equivalencies work to understand how these compare.
| #### Onesweep Radix Sort Overview | ||
|
|
||
| The onesweep radix sort algorithm [[3]](#references) processes keys in a single pass per radix stage, unlike | ||
| traditional radix sort implementations that use an additional input pass per radix stage. For an 8-bit radix, a 32-bit |
There was a problem hiding this comment.
It may be good to define what a radix stage is.
| 7. Atomic fetch add global histogram with SLM histogram for all stages and bins | ||
| ``` | ||
|
|
||
| **Scan kernel** — Converts per-bin counts into global incoming offsets for each bin across each stage. |
There was a problem hiding this comment.
There should be a mention of the grainsize of these global counts. If I remember correctly, there is a copy of each bin count for each "workgroup block" of input keys, meaning the data handled by a single workgroup.
|
|
||
| ```mermaid | ||
| graph LR | ||
| A[User API] --> B[Dispatcher<br/>radix_sort_dispatchers.h] |
There was a problem hiding this comment.
The user api should be split here into sycl api and esimd api
| **SYCL Implementation:** | ||
| The SYCL implementation provides a onesweep implementation. The implementation supports keys-only sorting, key-value | ||
| pair sorting, and out-of-place sorting variants. The one-work-group optimization remains an open topic for future work. | ||
|
|
There was a problem hiding this comment.
I think we should mention here that this was a design decision to decline to dispatch to one-work-group optimization for sycl radix sort, and that we may add an explicit one-work-group kernel template in the future. We did not update esimd to be consistent, since we see the sycl path to be the forward-looking implementation of choice, and what we want to spend our time on.
| - It relies on a hardware scheduler requirement that a work-group will not be indefinitely preempted by higher indexed | ||
| work-groups once it begins executing | ||
|
|
||
| ## Open Questions |
There was a problem hiding this comment.
The way a lot of these are phrase or organized, I look at them more as future work than an open question.
For some, I think they can be reframed to be an open question. "should we explore doing this?" etc.
It may just be that we want to rephrase these, or name the section differently. You could also split this into a future work section for things we have more or less decided are a good idea to spend effort on, and frame the rest as open questions.
| The exit criteria for this feature align with the [kernel templates exit | ||
| criteria](../README.md#exit-criteria) in addition to addressing the above open questions. |
There was a problem hiding this comment.
As the previous section are more "future work" items, I think the exit criteria involves defining the scope of a "fully supported" feature on these dimensions, and then implementing to that specification. I dont think all of the above items need to be implemented / supported to exit experimental.
I agree that we need more concrete acceptance of KT in general as well.
ESIMD and SYCL based radix sort kernel templates have already been implemented as experimental features. This RFC serves to document the benefits the algorithms deliver, the high-level pseudocode of their implementations, reasoning for decisions made along the way, and expected future work.