support cp ,fix qwen3.5 gdn sp by meichangsu1 · Pull Request #138 · modelscope/twinkle

meichangsu1 · 2026-04-02T11:49:00Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

This PR adds context parallel and Qwen3.5 Gated DeltaNet sequence parallel support to the transformers stack, and refactors sequence parallel into a package-based implementation.

Main changes:

Refactor sequence_parallel.py into sequence_parallel/ and add shared utilities.
Add derived ring / zigzag ring attention support for CP + SP.
Add Qwen3.5 linear attention SP support in linear_attention_sp.py;Ring attention is not supported for this path yet.
Update transformers model / processor paths to work with the new SP+CP flow.
Adjust loss metric aggregation for Ulysses replicated loss behavior.
Update cookbook examples for sp_fsdp_dense.
Add test coverage for:
- Qwen3.5 linear attention SP alignment
- sequence parallel + context parallel behavior
Remove outdated tests/moe/test_expert_parallel_qwen3_fsdp_sp.py.

Experiment results

gemini-code-assist

Code Review

This pull request significantly enhances sequence parallelism support by implementing ZigZag Ring Attention for long-sequence training and Ulysses-style sequence parallelism for Qwen3.5 linear attention. It also introduces multimodal deepstack patching for Qwen3-VL and refactors the SequenceParallel strategy to better handle complex device meshes and packed/varlen inputs. Feedback focuses on improving code maintainability and robustness, specifically by grouping attributes in the SequenceParallel constructor, removing redundant logic and unused imports, replacing deprecated inspection methods, and centralizing duplicated loss-gathering logic.

gemini-code-assist · 2026-04-02T11:51:59Z

+        self.seq_world_size = None
        self.sp_world_size = None
+        self.rp_world_size = None
        self.dp_world_size = None
        self.world_size = None
+        self.attn_implementation = None
        self.model_dtype = None
        self.tokenizer = None
        self.device_mesh = None
        self._sp_group = None
+        self._rp_group = None
+        self._data_rank_group = None
+        self._sp_rank = 0
+        self._rp_rank = 0
        self.num_heads = None
        self.causal_mask_func = None
        self.extra_kwargs = {}


The init method is becoming quite large with many attributes. Consider grouping related attributes into a dataclass or a separate configuration object to improve maintainability.

gemini-code-assist · 2026-04-02T11:51:59Z

+                        if query.shape[2] != total_tokens:
+                            raise ValueError('Packed/varlen flash_attention_2 expects query sequence length to match '
+                                             f'cu_seqlens total tokens, got query_seq_len={query.shape[2]} '
+                                             f'and cu_seqlens_total={total_tokens}.')


The check if world_size and world_size > 1 is redundant if world_size is guaranteed to be an integer. If it can be None, consider a more explicit check or default value handling.

gemini-code-assist · 2026-04-02T11:51:59Z

+        if self.rp_world_size > 1:
+            attn_impl = getattr(model.config, '_attn_implementation', None)
+            if attn_impl != 'flash_attention_2':
+                raise NotImplementedError('Derived ring attention only supports flash_attention_2 backend.')


The check for flash_attention_2 backend should ideally be done using a constant or a centralized configuration check to avoid hardcoded strings.

gemini-code-assist · 2026-04-02T11:51:59Z

@@ -0,0 +1,283 @@
+import os


The import import os is unused in this file. Please remove it.

gemini-code-assist · 2026-04-02T11:51:59Z

+@cache
+def _get_default_args(func):
+    spec = inspect.getfullargspec(func)
+    defaults = spec.defaults if spec.defaults is not None else ()
+    padded_defaults = (None, ) * (len(spec.args) - len(defaults)) + defaults
+    args = dict(zip(spec.args, padded_defaults))
+    if 'softcap' in args:
+        args['softcap'] = 0.0
+    return args


The _get_default_args function uses inspect.getfullargspec which is deprecated in newer Python versions. Consider using inspect.signature instead.

gemini-code-assist · 2026-04-02T11:51:59Z

+        if self.sp_strategy is not None:
+            loss_inputs, loss_outputs = self.sp_strategy.gather_loss_tensors(inputs, outputs)


The logic for gathering loss tensors is duplicated or very similar to logic in other parts of the codebase. Consider centralizing this loss gathering logic to avoid drift.

tastelikefeet · 2026-04-15T07:25:07Z

+            # local labels still count only the shard-local tokens. Normalize the loss
+            # contribution here so metric-side averaging matches the non-SP path.
+            if ulysses_size > 1:
+                loss = loss / float(ulysses_size)


为什么会放到这里呢，或者说，model进行backward的loss是否需要除以ulysses-size

loss_instance 的reduction 为sum时，这里loss 是在每个 ulysses rank 上都复制了一份的全序列 loss，但这里统计的 num_tokens 还是每个 rank 本地 shard 的 token 数。两边口径不一致，所以要除一次 ulysses_size，这里除一下只是只为修 metric 打印口径；至于反向传播时loss是没有除以ulysses size的，在GatherLoss.apply中只保留了本地梯度

还有我感觉这里当开启sp时这里这么判断会有点问题，当fsdp=2时，raw_dp_world_size=2,而data_world_size=1,此时就跳过gather了

另外这里应该也要改成raw_dp_fsdp_world_size,因为后面gather 的维度是process_group,不是data_world_size

tastelikefeet · 2026-04-15T07:33:40Z

 from twinkle.utils.grad_clip import normalize_and_clip_grad_norm


+def _get_raw_dp_fsdp_world_size(device_mesh: Optional[DeviceMesh]) -> int:


这个和device_mesh的dp_world_size似乎是一样的？能否复用

不一样，这里算的是 dp_world_size * fsdp_world_size，device_mesh的dp_world_size是： @Property
def dp_world_size(self) -> int:
return self._get_world_size_for_dim('dp')

你看看这个实现满足需求吗

这个和data_rank是专门用来判断数据组的，而且即使真的增加dpworldsize，实现也应该放在devicemesh里面复用，而非放在transformers.py里面，模型和数据组判断耦合会带来维护问题

tastelikefeet · 2026-04-15T07:35:40Z

-        result = loss_instance(inputs, outputs, **kwargs)
+        loss_inputs = inputs
+        loss_outputs = outputs
+        if self.sp_strategy is not None:


这部分能否使用inputprocessor？既然切分是inputprocessor做，那gather是否应该也放在里面

应该不太合适吧，这里已经是到了loss 计算阶段了，inputprocessor的职责应该是做输入的处理的吧

inputprocessor的名字起的可能不太好，这个组件就是为了做任务相关的数据处理的，放在模型代码里面，如果再增加一个子类，实现不是要重写一遍

- Refactor linear attention sequence parallel import error message into a constantt - Fix token counting in TransformersModel by using raw DP/FSDP world size instead of data_world_size - Enhance Framework.gather_object to check distributed initialization before accessing world size - Add test utility for creating padded labels in sequence parallel tests

- Add `num_tokens` field to `ModelOutput` TypedDict for explicit token denominator - Update `LossOutput` to use `OutputType` for `num_tokens` instead of `int` - Refactor `LossMetric` to prefer `num_tokens` from outputs, with fallback to labels - Remove `_get_raw_dp_fsdp_world_size` helper and use `_device_mesh._get_dp_fsdp_world_size` - Use `InputProcessor.postprocess_tensor_sp` for loss tensor gathering in TransformersModel - Simplify sequence-parallel loss normalization by relying on output `num_tokens`

meichangsu1 changed the title ~~Fsdp cp ljl~~ support cp ，fix qwen3.5 gdn sp Apr 2, 2026

gemini-code-assist bot reviewed Apr 2, 2026

View reviewed changes

support cp and gdn sp

1d5eb8f

meichangsu1 force-pushed the fsdp_cp_ljl branch from 4c6e913 to 1d5eb8f Compare April 2, 2026 12:01

meichangsu1 changed the title ~~support cp ，fix qwen3.5 gdn sp~~ support cp ,fix qwen3.5 gdn sp Apr 2, 2026

fix lint

7816375

meichangsu1 force-pushed the fsdp_cp_ljl branch from c96502e to 7816375 Compare April 2, 2026 12:42

narrow precise test scope

c509ae5

tastelikefeet reviewed Apr 15, 2026

View reviewed changes

Comment thread src/twinkle/model/transformers/strategy/sequence_parallel/linear_attention_sp.py

tastelikefeet reviewed Apr 15, 2026

View reviewed changes

meichangsu1 added 4 commits April 17, 2026 17:05

Merge remote-tracking branch 'origin/main' into fsdp_cp_ljl

e4c7125

fix(utils): Fix gather_object to use process group world size under SP

7d64ce1

meichangsu1 force-pushed the fsdp_cp_ljl branch from 0e8600d to f13bc37 Compare April 20, 2026 09:59

		if self.sp_strategy is not None:
		loss_inputs, loss_outputs = self.sp_strategy.gather_loss_tensors(inputs, outputs)

		from twinkle.utils.grad_clip import normalize_and_clip_grad_norm


		def _get_raw_dp_fsdp_world_size(device_mesh: Optional[DeviceMesh]) -> int:

Conversation

meichangsu1 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR type

PR information

Experiment results

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

meichangsu1 commented Apr 2, 2026 •

edited

Loading