ggml-cpu: extend RVV quantization vec dot to higher VLENs#10
Open
taimur-10x wants to merge 75 commits intomasterfrom
Open
ggml-cpu: extend RVV quantization vec dot to higher VLENs#10taimur-10x wants to merge 75 commits intomasterfrom
taimur-10x wants to merge 75 commits intomasterfrom
Conversation
86ffc7e to
bd69a20
Compare
f83ddf7 to
c7c6abc
Compare
cf95828 to
05a5425
Compare
05a5425 to
80c0ac3
Compare
* CUDA: refactor mma data loading for AMD * fix CDNA MMQ occupancy * fix CDNA3 mma * fix RDNA3 compile
* [SYCL] Fix reorder MMVQ assert on unaligned vocab sizes The reorder mul_mat_vec_q dispatchers for Q4_0, Q8_0, Q4_K, and Q6_K asserted that block_num_y was a multiple of 16 subgroups. Models with a vocab size not divisible by 16 (for example HY-MT at 120818) aborted on model load when the output projection tripped the assert. I replaced the assert with padding: block_num_y now rounds up to a whole number of subgroup-sized workgroups. The kernel already has the row bounds check (`if (row >= nrows) return;`) so the extra padded threads early-exit cleanly. Row values are uniform across a subgroup so the collective reduce stays safe. For aligned vocab sizes the padded block_num_y equals the old value, so the kernel launch is identical and there is no regression. Thanks to @arthw for flagging the relationship to ggml-org#21527. Fixes ggml-org#22020. AI assisted coding, tested on Intel B70 hardware. * sycl: use WARP_SIZE for num_subgroups in reorder MMVQ launches Replaces the hardcoded 16 with WARP_SIZE in the four reorder_mul_mat_vec launch helpers (Q4_0, Q8_0, Q4_K, Q6_K). Compile-time no-op on the Intel target where WARP_SIZE is 16, but makes the relationship to subgroup size explicit. Per review by @NeoZhangJianyu on ggml-org#22035. Assisted by Claude.
…22102) * llama: fix crash in print_info for GLM-DSA when vocab_only is set * addressed code review comments * cont : simplify --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* merged properly, but slow q3_k and q5_k with u32 indexing * Start on new mat-vec * New format float paths working * Working q4_0 * Work on remaining legacy q-types * port k-quants to new matvec * remove old shader * Remove old constants, format * remove accidental file --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com>
…g#21636) * Implemented optimized q1_0 dot for x86 and generic * Removed redundant helper definition * Removed two redundant instructions from AVX q1_0 dot * Fixed inconsistency with fp16 conversion for generic q1_0 dot and deduplicated generic fallback * Style cleanup around AVX q1_0 dot * Replaced explicitly unrolled blocks with inner for loop for q1_0 * Replaced scalar ARM q1_0 impl with new generic one
* TP: fix 0-sized tensor slices, AllReduce fallback * fix layer structure <-> GPU count aliasing * add missing std::fill * fix CUDA device set, max ggml ctx size
* Fix delayed AllReduce on Gemma-4 MoE Skip forward past nodes that don't consume the current one, and allow a chain of MULs. * Check for all sources before skipping nodes * Address review comments
* server : remove /api endpoints * cont : remove /api/tags
* ggml-cuda: flush legacy pool on OOM and retry Signed-off-by: 梁厚宏 <2695316095@qq.com> * Address review comments: add explicit sync, update destructor, clean up MUSA macros Signed-off-by: 梁厚宏 <2695316095@qq.com> --------- Signed-off-by: 梁厚宏 <2695316095@qq.com>
…org#18760) (ggml-org#22003) Fixes: ggml-org#18760 Co-authored-by: Christian <christian@example.com>
…ice (ggml-org#22171) * fit-params : add option to output estimated memory per device * cont : minor * cont : refactor * cont : move fit params implementation to libcommon * cont : header * cont : headers * cont : codeowners
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds RVV implementations for quantized vector dot kernels (for VLENs 512-bit and 1024-bit).
Key Changes
ggml_vec_dot_tq1_0_q8_K_vl512.Testing
Kernels were functionally tested through
test-quantize-fnsfor VLENs512-bitand1024-bitwith QEMU, usingtest-quantize-fns.