Skip to content

ggml-cpu: extend RVV quantization vec dot to higher VLENs#10

Open
taimur-10x wants to merge 75 commits intomasterfrom
10x/riscv-quant-vec-dot-vlens
Open

ggml-cpu: extend RVV quantization vec dot to higher VLENs#10
taimur-10x wants to merge 75 commits intomasterfrom
10x/riscv-quant-vec-dot-vlens

Conversation

@taimur-10x
Copy link
Copy Markdown
Collaborator

@taimur-10x taimur-10x commented Feb 13, 2026

Summary

This PR adds RVV implementations for quantized vector dot kernels (for VLENs 512-bit and 1024-bit).

Key Changes

  • Added the following RVV kernels:
Kernel VLEN
ggml_vec_dot_q3_K_q8_K 512, 1024
ggml_vec_dot_q6_K_q8_K 512, 1024
ggml_vec_dot_iq1_s_q8_K 512, 1024
ggml_vec_dot_iq1_m_q8_K 512, 1024
ggml_vec_dot_iq2_s_q8_K 512, 1024 and above
ggml_vec_dot_iq2_xs_q8_K 512 and above
ggml_vec_dot_iq2_xxs_q8_K 512 and above
ggml_vec_dot_iq3_s_q8_K 512 and above
ggml_vec_dot_iq3_xxs_q8_K 512, 1024 and above
ggml_vec_dot_iq4_xs_q8_K 512, 1024
ggml_vec_dot_tq1_0_q8_K 512 and above
  • Each kernel that is VLEN-dependent has its own separate function call now, for example ggml_vec_dot_tq1_0_q8_K_vl512.

Testing

Kernels were functionally tested through test-quantize-fns for VLENs 512-bit and 1024-bit with QEMU, using test-quantize-fns.

@taimur-10x taimur-10x marked this pull request as draft February 13, 2026 15:52
@github-actions github-actions Bot added the ggml label Feb 13, 2026
@taimur-10x taimur-10x changed the base branch from master to 10x/riscv-quant-vec-dot-128b March 4, 2026 11:48
@taimur-10x taimur-10x force-pushed the 10x/riscv-quant-vec-dot-vlens branch from 86ffc7e to bd69a20 Compare March 4, 2026 17:10
@taimur-10x taimur-10x marked this pull request as ready for review March 4, 2026 17:11
@rehan-10xengineer rehan-10xengineer force-pushed the 10x/riscv-quant-vec-dot-128b branch 2 times, most recently from f83ddf7 to c7c6abc Compare March 16, 2026 10:55
@taimur-10x taimur-10x force-pushed the 10x/riscv-quant-vec-dot-128b branch 3 times, most recently from cf95828 to 05a5425 Compare March 18, 2026 12:47
@rehan-10xengineer rehan-10xengineer force-pushed the 10x/riscv-quant-vec-dot-128b branch from 05a5425 to 80c0ac3 Compare April 14, 2026 11:37
JohannesGaessler and others added 17 commits April 19, 2026 18:26
* CUDA: refactor mma data loading for AMD

* fix CDNA MMQ occupancy

* fix CDNA3 mma

* fix RDNA3 compile
* [SYCL] Fix reorder MMVQ assert on unaligned vocab sizes

The reorder mul_mat_vec_q dispatchers for Q4_0, Q8_0, Q4_K, and Q6_K
asserted that block_num_y was a multiple of 16 subgroups. Models with
a vocab size not divisible by 16 (for example HY-MT at 120818) aborted
on model load when the output projection tripped the assert.

I replaced the assert with padding: block_num_y now rounds up to a
whole number of subgroup-sized workgroups. The kernel already has the
row bounds check (`if (row >= nrows) return;`) so the extra padded
threads early-exit cleanly. Row values are uniform across a subgroup
so the collective reduce stays safe.

For aligned vocab sizes the padded block_num_y equals the old value,
so the kernel launch is identical and there is no regression.

Thanks to @arthw for flagging the relationship to ggml-org#21527.

Fixes ggml-org#22020.

AI assisted coding, tested on Intel B70 hardware.

* sycl: use WARP_SIZE for num_subgroups in reorder MMVQ launches

Replaces the hardcoded 16 with WARP_SIZE in the four reorder_mul_mat_vec
launch helpers (Q4_0, Q8_0, Q4_K, Q6_K). Compile-time no-op on the Intel
target where WARP_SIZE is 16, but makes the relationship to subgroup
size explicit. Per review by @NeoZhangJianyu on ggml-org#22035.

Assisted by Claude.
…22102)

* llama: fix crash in print_info for GLM-DSA when vocab_only is set

* addressed code review comments

* cont : simplify

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* merged properly, but slow q3_k and q5_k with u32 indexing

* Start on new mat-vec

* New format float paths working

* Working q4_0

* Work on remaining legacy q-types

* port k-quants to new matvec

* remove old shader

* Remove old constants, format

* remove accidental file

---------

Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
…g#21636)

* Implemented optimized q1_0 dot for x86 and generic

* Removed redundant helper definition

* Removed two redundant instructions from AVX q1_0 dot

* Fixed inconsistency with fp16 conversion for generic q1_0 dot and deduplicated generic fallback

* Style cleanup around AVX q1_0 dot

* Replaced explicitly unrolled blocks with inner for loop for q1_0

* Replaced scalar ARM q1_0 impl with new generic one
* TP: fix 0-sized tensor slices, AllReduce fallback

* fix layer structure <-> GPU count aliasing

* add missing std::fill

* fix CUDA device set, max ggml ctx size
* Fix delayed AllReduce on Gemma-4 MoE

Skip forward past nodes that don't consume the current one, and allow a chain of MULs.

* Check for all sources before skipping nodes

* Address review comments
* server : remove /api endpoints

* cont : remove /api/tags
* ggml-cuda: flush legacy pool on OOM and retry

Signed-off-by: 梁厚宏 <2695316095@qq.com>

* Address review comments: add explicit sync, update destructor, clean up MUSA macros

Signed-off-by: 梁厚宏 <2695316095@qq.com>

---------

Signed-off-by: 梁厚宏 <2695316095@qq.com>
…ice (ggml-org#22171)

* fit-params : add option to output estimated memory per device

* cont : minor

* cont : refactor

* cont : move fit params implementation to libcommon

* cont : header

* cont : headers

* cont : codeowners
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.