Skip to content

[RFC] Add RFC for SYCL & ESIMD GPU radix sort#2643

Open
mmichel11 wants to merge 28 commits intomainfrom
dev/mmichel11/kt_sort_rfc
Open

[RFC] Add RFC for SYCL & ESIMD GPU radix sort#2643
mmichel11 wants to merge 28 commits intomainfrom
dev/mmichel11/kt_sort_rfc

Conversation

@mmichel11
Copy link
Copy Markdown
Contributor

ESIMD and SYCL based radix sort kernel templates have already been implemented as experimental features. This RFC serves to document the benefits the algorithms deliver, the high-level pseudocode of their implementations, reasoning for decisions made along the way, and expected future work.

- Please note this draft was planned and executed by an AI agent with my
  guidance and review

Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an RFC documenting the design, safety model, and expected performance characteristics of the experimental oneDPL kernel-template GPU radix sort implementations (ESIMD for PVC and a forward-progress-based pure SYCL variant).

Changes:

  • Documented the exposed kernel-template radix sort APIs (keys-only and key-value variants) and tuning parameters.
  • Described the onesweep + decoupled-lookback algorithm, including forward-progress safety considerations for SYCL.
  • Added platform support notes, testing matrix, and open questions / exit criteria.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


### Platform Support

The ESIMD implementation is limited to Intel PVC architecture with a work-group size of 64 and data per work item
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ESIMD work-group size constraint is documented as fixed at 64, but the implementation allows 32 or 64 (see __check_esimd_sort_params in include/oneapi/dpl/experimental/kt/internal/esimd_radix_sort_utils.h). Please align the RFC with the actual constraint (or clarify if 32 is unsupported for the onesweep path).

Suggested change
The ESIMD implementation is limited to Intel PVC architecture with a work-group size of 64 and data per work item
The ESIMD implementation is limited to Intel PVC architecture with a work-group size of 32 or 64 and data per work item

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmichel11 is and data per work item here stating that the implementation is limited to 64 data elements per work item? It reads to me like a dangling phrase and I would suggest repeating the size limitation if that is correct.

Copy link
Copy Markdown
Contributor Author

@mmichel11 mmichel11 Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full sentence wraps around to the next line and is and data per work item multiple of 32. meaning only work group size of 64 is supported and data per work item must be a multiple of 32. I can rephrase if it is still confusing.

@dmitriy-sobolev Could you let me know what the actual limitation for ESIMD work group size is? I see our implementation enables 32 and 64 but our current documentation states that only 64 is supported and that is the only test scenario I see.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64 is the actual limitation. There were issues with 32-bit in some rare cases, that's why it was not stated as supported.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
The returned event or underlying queue must be waited on to guarantee algorithm completion.

Unified dispatch logic in `radix_sort_dispatchers.h` and `radix_sort_submitters.h` selects the appropriate
implementation and optimization path based on the input size and available hardware features.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to clarify that you mean that these serve both ESIMD and sycl variants.

I'm not sure if you want to get into discussing the choice of implementation / optimization based on input size here. If so, you may need to go into a bit more detail about what you mean, or link to the section which does.

| Work-group size | 64 | 512, 1024 |
| Data per work-item | Multiples of 32 (limited by SLM capacity) | Multiples of 1 (limited by SLM capacity) |
| Radix bits | 8 | 8 |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to briefly describe or link to how ESIMD vs sycl workgroup size and data per work-item equivalencies work to understand how these compare.

#### Onesweep Radix Sort Overview

The onesweep radix sort algorithm [[3]](#references) processes keys in a single pass per radix stage, unlike
traditional radix sort implementations that use an additional input pass per radix stage. For an 8-bit radix, a 32-bit
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be good to define what a radix stage is.

7. Atomic fetch add global histogram with SLM histogram for all stages and bins
```

**Scan kernel** — Converts per-bin counts into global incoming offsets for each bin across each stage.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a mention of the grainsize of these global counts. If I remember correctly, there is a copy of each bin count for each "workgroup block" of input keys, meaning the data handled by a single workgroup.


```mermaid
graph LR
A[User API] --> B[Dispatcher<br/>radix_sort_dispatchers.h]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user api should be split here into sycl api and esimd api

**SYCL Implementation:**
The SYCL implementation provides a onesweep implementation. The implementation supports keys-only sorting, key-value
pair sorting, and out-of-place sorting variants. The one-work-group optimization remains an open topic for future work.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should mention here that this was a design decision to decline to dispatch to one-work-group optimization for sycl radix sort, and that we may add an explicit one-work-group kernel template in the future. We did not update esimd to be consistent, since we see the sycl path to be the forward-looking implementation of choice, and what we want to spend our time on.

- It relies on a hardware scheduler requirement that a work-group will not be indefinitely preempted by higher indexed
work-groups once it begins executing

## Open Questions
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way a lot of these are phrase or organized, I look at them more as future work than an open question.

For some, I think they can be reframed to be an open question. "should we explore doing this?" etc.

It may just be that we want to rephrase these, or name the section differently. You could also split this into a future work section for things we have more or less decided are a good idea to spend effort on, and frame the rest as open questions.

Comment on lines +382 to +383
The exit criteria for this feature align with the [kernel templates exit
criteria](../README.md#exit-criteria) in addition to addressing the above open questions.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the previous section are more "future work" items, I think the exit criteria involves defining the scope of a "fully supported" feature on these dimensions, and then implementing to that specification. I dont think all of the above items need to be implemented / supported to exit experimental.

I agree that we need more concrete acceptance of KT in general as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants