[RFC] Add RFC for SYCL & ESIMD GPU radix sort by mmichel11 · Pull Request #2643 · uxlfoundation/oneDPL

mmichel11 · 2026-03-23T16:35:08Z

ESIMD and SYCL based radix sort kernel templates have already been implemented as experimental features. This RFC serves to document the benefits the algorithms deliver, the high-level pseudocode of their implementations, reasoning for decisions made along the way, and expected future work.

- Please note this draft was planned and executed by an AI agent with my guidance and review Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

This reverts commit 5c4f59a.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Copilot

Pull request overview

Adds an RFC documenting the design, safety model, and expected performance characteristics of the experimental oneDPL kernel-template GPU radix sort implementations (ESIMD for PVC and a forward-progress-based pure SYCL variant).

Changes:

Documented the exposed kernel-template radix sort APIs (keys-only and key-value variants) and tuning parameters.
Described the onesweep + decoupled-lookback algorithm, including forward-progress safety considerations for SYCL.
Added platform support notes, testing matrix, and open questions / exit criteria.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rfcs/experimental/kernel_templates/radix_sort_esimd_sycl/README.md

rfcs/experimental/kernel_templates/radix_sort/README.md

rfcs/experimental/kernel_templates/radix_sort_esimd_sycl/README.md

rfcs/experimental/kernel_templates/radix_sort/README.md

rfcs/experimental/kernel_templates/radix_sort_esimd_sycl/README.md

Copilot · 2026-03-23T16:39:21Z

rfcs/experimental/kernel_templates/radix_sort/README.md

+
+### Platform Support
+
+The ESIMD implementation is limited to Intel PVC architecture with a work-group size of 64 and data per work item


The ESIMD work-group size constraint is documented as fixed at 64, but the implementation allows 32 or 64 (see __check_esimd_sort_params in include/oneapi/dpl/experimental/kt/internal/esimd_radix_sort_utils.h). Please align the RFC with the actual constraint (or clarify if 32 is unsupported for the onesweep path).

Suggested change

The ESIMD implementation is limited to Intel PVC architecture with a work-group size of 64 and data per work item

The ESIMD implementation is limited to Intel PVC architecture with a work-group size of 32 or 64 and data per work item

@mmichel11 is and data per work item here stating that the implementation is limited to 64 data elements per work item? It reads to me like a dangling phrase and I would suggest repeating the size limitation if that is correct.

The full sentence wraps around to the next line and is and data per work item multiple of 32. meaning only work group size of 64 is supported and data per work item must be a multiple of 32. I can rephrase if it is still confusing.

@dmitriy-sobolev Could you let me know what the actual limitation for ESIMD work group size is? I see our implementation enables 32 and 64 but our current documentation states that only 64 is supported and that is the only test scenario I see.

64 is the actual limitation. There were issues with 32-bit in some rare cases, that's why it was not stated as supported.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

danhoeflinger · 2026-03-27T20:07:48Z

rfcs/experimental/kernel_templates/radix_sort/README.md

+The returned event or underlying queue must be waited on to guarantee algorithm completion.
+
+Unified dispatch logic in `radix_sort_dispatchers.h` and `radix_sort_submitters.h` selects the appropriate
+implementation and optimization path based on the input size and available hardware features.


It would be good to clarify that you mean that these serve both ESIMD and sycl variants.

I'm not sure if you want to get into discussing the choice of implementation / optimization based on input size here. If so, you may need to go into a bit more detail about what you mean, or link to the section which does.

danhoeflinger · 2026-03-27T20:07:50Z

rfcs/experimental/kernel_templates/radix_sort/README.md

+| Work-group size    | 64                                        | 512, 1024                                |
+| Data per work-item | Multiples of 32 (limited by SLM capacity) | Multiples of 1 (limited by SLM capacity) |
+| Radix bits         | 8                                         | 8                                        |
+


It might be useful to briefly describe or link to how ESIMD vs sycl workgroup size and data per work-item equivalencies work to understand how these compare.

danhoeflinger · 2026-03-27T20:12:09Z

rfcs/experimental/kernel_templates/radix_sort/README.md

+#### Onesweep Radix Sort Overview
+
+The onesweep radix sort algorithm [[3]](#references) processes keys in a single pass per radix stage, unlike
+traditional radix sort implementations that use an additional input pass per radix stage. For an 8-bit radix, a 32-bit


It may be good to define what a radix stage is.

danhoeflinger · 2026-03-27T20:14:33Z

rfcs/experimental/kernel_templates/radix_sort/README.md

+7.    Atomic fetch add global histogram with SLM histogram for all stages and bins
+```
+
+**Scan kernel** — Converts per-bin counts into global incoming offsets for each bin across each stage.


There should be a mention of the grainsize of these global counts. If I remember correctly, there is a copy of each bin count for each "workgroup block" of input keys, meaning the data handled by a single workgroup.

danhoeflinger · 2026-03-27T20:21:36Z

rfcs/experimental/kernel_templates/radix_sort/README.md

+
+```mermaid
+graph LR
+    A[User API] --> B[Dispatcher<br/>radix_sort_dispatchers.h]


The user api should be split here into sycl api and esimd api

danhoeflinger · 2026-03-27T20:28:08Z

rfcs/experimental/kernel_templates/radix_sort/README.md

+**SYCL Implementation:**
+The SYCL implementation provides a onesweep implementation. The implementation supports keys-only sorting, key-value
+pair sorting, and out-of-place sorting variants. The one-work-group optimization remains an open topic for future work.
+


I think we should mention here that this was a design decision to decline to dispatch to one-work-group optimization for sycl radix sort, and that we may add an explicit one-work-group kernel template in the future. We did not update esimd to be consistent, since we see the sycl path to be the forward-looking implementation of choice, and what we want to spend our time on.

danhoeflinger · 2026-03-27T20:33:58Z

rfcs/experimental/kernel_templates/radix_sort/README.md

+- It relies on a hardware scheduler requirement that a work-group will not be indefinitely preempted by higher indexed
+work-groups once it begins executing
+
+## Open Questions


The way a lot of these are phrase or organized, I look at them more as future work than an open question.

For some, I think they can be reframed to be an open question. "should we explore doing this?" etc.

It may just be that we want to rephrase these, or name the section differently. You could also split this into a future work section for things we have more or less decided are a good idea to spend effort on, and frame the rest as open questions.

danhoeflinger · 2026-03-27T20:36:38Z

rfcs/experimental/kernel_templates/radix_sort/README.md

+The exit criteria for this feature align with the [kernel templates exit
+criteria](../README.md#exit-criteria) in addition to addressing the above open questions.


As the previous section are more "future work" items, I think the exit criteria involves defining the scope of a "fully supported" feature on these dimensions, and then implementing to that specification. I dont think all of the above items need to be implemented / supported to exit experimental.

I agree that we need more concrete acceptance of KT in general as well.

mmichel11 added 25 commits March 11, 2026 09:48

A rough draft of KT sort RFC

9b33b36

- Please note this draft was planned and executed by an AI agent with my guidance and review Signed-off-by: Matthew Michel <matthew.michel@intel.com>

relocate kt rfc

4f89a4b

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

edits

66e6b42

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

align better with template and adjust diretories

5c4f59a

remove temp files

8a419b1

Revert "align better with template and adjust diretories"

6c1db3a

This reverts commit 5c4f59a.

Add testing and other updates

c750b02

better examples and signatures

0daf766

mermaid diagram with design

bc954eb

two distinct subtrees for design

d6f3e6e

add alternative mermaid diagram

fa32b94

mermaid diagram improvements

5a79a1c

mermaid diagram improvements

2090adf

looping in mermaid diagram

9470dd6

visual updates and reordering

a82089f

small updates

1d0768c

revisions

f02953d

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

more improvements

9e55a2e

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Improve pseudocode

c776bb6

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

More pseudocode updates

c694e16

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

reorder citations

29b3f51

add bitrange open

b1c2ede

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

wording adjustements

ecf1cf4

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

revisions

fe22039

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Line folding and table formatting

09507bd

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 added the RFC label Mar 23, 2026

mmichel11 requested a review from Copilot March 23, 2026 16:36

Copilot started reviewing on behalf of mmichel11 March 23, 2026 16:36 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

Apply suggestions from code review

1a14fbd

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

mmichel11 added 2 commits March 24, 2026 13:30

Rename radix sort kernel templates directory and title

9053631

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Small wording adjustments

acd5d8c

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 marked this pull request as ready for review March 25, 2026 17:04

mmichel11 requested review from MikeDvorskiy, SergeyKopienko, adamfidel, akukanov, danhoeflinger, dmitriy-sobolev and timmiesmith March 25, 2026 22:36

danhoeflinger reviewed Mar 27, 2026

View reviewed changes


		### Platform Support

		The ESIMD implementation is limited to Intel PVC architecture with a work-group size of 64 and data per work item

		The exit criteria for this feature align with the [kernel templates exit
		criteria](../README.md#exit-criteria) in addition to addressing the above open questions.

Conversation

mmichel11 commented Mar 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mmichel11 Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mmichel11 Mar 27, 2026 •

edited

Loading