ggml-cpu: extend RVV quantization vec dot to higher VLENs by taimur-10x · Pull Request #10 · riseproject-dev/llama.cpp

taimur-10x · 2026-02-13T15:52:41Z

Summary

This PR adds RVV implementations for quantized vector dot kernels (for VLENs 512-bit and 1024-bit).

Key Changes

Added the following RVV kernels:

Kernel	VLEN
ggml_vec_dot_q3_K_q8_K	512, 1024
ggml_vec_dot_q6_K_q8_K	512, 1024
ggml_vec_dot_iq1_s_q8_K	512, 1024
ggml_vec_dot_iq1_m_q8_K	512, 1024
ggml_vec_dot_iq2_s_q8_K	512, 1024 and above
ggml_vec_dot_iq2_xs_q8_K	512 and above
ggml_vec_dot_iq2_xxs_q8_K	512 and above
ggml_vec_dot_iq3_s_q8_K	512 and above
ggml_vec_dot_iq3_xxs_q8_K	512, 1024 and above
ggml_vec_dot_iq4_xs_q8_K	512, 1024
ggml_vec_dot_tq1_0_q8_K	512 and above

Each kernel that is VLEN-dependent has its own separate function call now, for example ggml_vec_dot_tq1_0_q8_K_vl512.

Testing

Kernels were functionally tested through test-quantize-fns for VLENs 512-bit and 1024-bit with QEMU, using test-quantize-fns.

* CUDA: refactor mma data loading for AMD * fix CDNA MMQ occupancy * fix CDNA3 mma * fix RDNA3 compile

@arthw

* [SYCL] Fix reorder MMVQ assert on unaligned vocab sizes The reorder mul_mat_vec_q dispatchers for Q4_0, Q8_0, Q4_K, and Q6_K asserted that block_num_y was a multiple of 16 subgroups. Models with a vocab size not divisible by 16 (for example HY-MT at 120818) aborted on model load when the output projection tripped the assert. I replaced the assert with padding: block_num_y now rounds up to a whole number of subgroup-sized workgroups. The kernel already has the row bounds check (`if (row >= nrows) return;`) so the extra padded threads early-exit cleanly. Row values are uniform across a subgroup so the collective reduce stays safe. For aligned vocab sizes the padded block_num_y equals the old value, so the kernel launch is identical and there is no regression. Thanks to @arthw for flagging the relationship to ggml-org#21527. Fixes ggml-org#22020. AI assisted coding, tested on Intel B70 hardware. * sycl: use WARP_SIZE for num_subgroups in reorder MMVQ launches Replaces the hardcoded 16 with WARP_SIZE in the four reorder_mul_mat_vec launch helpers (Q4_0, Q8_0, Q4_K, Q6_K). Compile-time no-op on the Intel target where WARP_SIZE is 16, but makes the relationship to subgroup size explicit. Per review by @NeoZhangJianyu on ggml-org#22035. Assisted by Claude.

…22102) * llama: fix crash in print_info for GLM-DSA when vocab_only is set * addressed code review comments * cont : simplify --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* merged properly, but slow q3_k and q5_k with u32 indexing * Start on new mat-vec * New format float paths working * Working q4_0 * Work on remaining legacy q-types * port k-quants to new matvec * remove old shader * Remove old constants, format * remove accidental file --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com>

…g#21636) * Implemented optimized q1_0 dot for x86 and generic * Removed redundant helper definition * Removed two redundant instructions from AVX q1_0 dot * Fixed inconsistency with fp16 conversion for generic q1_0 dot and deduplicated generic fallback * Style cleanup around AVX q1_0 dot * Replaced explicitly unrolled blocks with inner for loop for q1_0 * Replaced scalar ARM q1_0 impl with new generic one

* TP: fix 0-sized tensor slices, AllReduce fallback * fix layer structure <-> GPU count aliasing * add missing std::fill * fix CUDA device set, max ggml ctx size

* Fix delayed AllReduce on Gemma-4 MoE Skip forward past nodes that don't consume the current one, and allow a chain of MULs. * Check for all sources before skipping nodes * Address review comments

* server : remove /api endpoints * cont : remove /api/tags

* ggml-cuda: flush legacy pool on OOM and retry Signed-off-by: 梁厚宏 <2695316095@qq.com> * Address review comments: add explicit sync, update destructor, clean up MUSA macros Signed-off-by: 梁厚宏 <2695316095@qq.com> --------- Signed-off-by: 梁厚宏 <2695316095@qq.com>

…org#18760) (ggml-org#22003) Fixes: ggml-org#18760 Co-authored-by: Christian <christian@example.com>

…ice (ggml-org#22171) * fit-params : add option to output estimated memory per device * cont : minor * cont : refactor * cont : move fit params implementation to libcommon * cont : header * cont : headers * cont : codeowners

taimur-10x marked this pull request as draft February 13, 2026 15:52

github-actions Bot added the ggml label Feb 13, 2026

taimur-10x changed the base branch from master to 10x/riscv-quant-vec-dot-128b March 4, 2026 11:48

taimur-10x force-pushed the 10x/riscv-quant-vec-dot-vlens branch from 86ffc7e to bd69a20 Compare March 4, 2026 17:10

taimur-10x marked this pull request as ready for review March 4, 2026 17:11

taimur-10x assigned taimur-10x and rehan-10xengineer Mar 10, 2026

rehan-10xengineer force-pushed the 10x/riscv-quant-vec-dot-128b branch 2 times, most recently from f83ddf7 to c7c6abc Compare March 16, 2026 10:55

taimur-10x force-pushed the 10x/riscv-quant-vec-dot-128b branch 3 times, most recently from cf95828 to 05a5425 Compare March 18, 2026 12:47

rehan-10xengineer force-pushed the 10x/riscv-quant-vec-dot-128b branch from 05a5425 to 80c0ac3 Compare April 14, 2026 11:37

JohannesGaessler and others added 17 commits April 19, 2026 18:26

CUDA: refactor mma data loading for AMD (ggml-org#22051)

4eac5b4

* CUDA: refactor mma data loading for AMD * fix CDNA MMQ occupancy * fix CDNA3 mma * fix RDNA3 compile

vendor : update cpp-httplib to 0.42.0 (ggml-org#21781)

e365e65

server: rename --clear-idle to --cache-idle-slots (ggml-org#21741)

9d49acb

server : refactor "use checkpoint" logic (ggml-org#22114)

de71b5f

fix: GLM-DSA crash in llama-tokenize when using vocab_only (ggml-org#…

81df3f7

…22102) * llama: fix crash in print_info for GLM-DSA when vocab_only is set * addressed code review comments * cont : simplify --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

mtmd: refactor mtmd_decode_use_mrope (ggml-org#22161)

a678916

TP: fix 0-sized tensor slices, AllReduce fallback (ggml-org#21808)

fb19f94

* TP: fix 0-sized tensor slices, AllReduce fallback * fix layer structure <-> GPU count aliasing * add missing std::fill * fix CUDA device set, max ggml ctx size

Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE (ggml-org#22129)

fd6ae4c

* Fix delayed AllReduce on Gemma-4 MoE Skip forward past nodes that don't consume the current one, and allow a chain of MULs. * Check for all sources before skipping nodes * Address review comments

server : remove /api endpoints (ggml-org#22165)

cf8b0db

* server : remove /api endpoints * cont : remove /api/tags

mtmd: correct get_n_pos / get_decoder_pos (ggml-org#22175)

86f8daa

server : fix hardcoded proxy connection timeout in router mode (ggml-…

ff6b106

…org#18760) (ggml-org#22003) Fixes: ggml-org#18760 Co-authored-by: Christian <christian@example.com>

ggml : bump version to 0.10.0 (ggml/1463)

041fe83

github-actions Bot added Apple Metal SYCL Vulkan examples devops python script server model jinja parser Hexagon WebGPU OpenVINO labels Apr 24, 2026

taimur-10x removed documentation Improvements or additions to documentation ggml testing Nvidia GPU Apple Metal SYCL Vulkan examples devops python script server model jinja parser Hexagon WebGPU OpenVINO labels Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: extend RVV quantization vec dot to higher VLENs#10

ggml-cpu: extend RVV quantization vec dot to higher VLENs#10
taimur-10x wants to merge 75 commits intomasterfrom
10x/riscv-quant-vec-dot-vlens

taimur-10x commented Feb 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

taimur-10x commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

taimur-10x commented Feb 13, 2026 •

edited

Loading