Skip to content

Modify remote_function decorators in multi_lora_transformers#173

Merged
tastelikefeet merged 1 commit intomodelscope:mainfrom
xichengpro:main
Apr 21, 2026
Merged

Modify remote_function decorators in multi_lora_transformers#173
tastelikefeet merged 1 commit intomodelscope:mainfrom
xichengpro:main

Conversation

@xichengpro
Copy link
Copy Markdown
Contributor

Updated remote_function decorators to specify collection methods.

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

When I'm using the self-host mode for LoRA SFT training, during the eval phase,

    for batch in dataloader:
        model.forward_only(inputs=batch)
        model.calculate_loss()

the following error occurs when executing the code below:

Traceback (most recent call last):
  File "/data/dubingnan/dbn-ceph/exp/coder/taas/sft_rslora_lf_aligned_lr.py", line 457, in <module>
    train()
  File "/data/dubingnan/dbn-ceph/exp/coder/taas/sft_rslora_lf_aligned_lr.py", line 437, in train
    eval_metrics = evaluate(model, eval_dataloader, global_step)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/dubingnan/dbn-ceph/exp/coder/taas/sft_rslora_lf_aligned_lr.py", line 329, in evaluate
    model.calculate_loss()
  File "/data/dubingnan/dbn-ceph/twinkle/src/twinkle_client/model/multi_lora_transformers.py", line 76, in calculate_loss
    response = http_post(
               ^^^^^^^^^^
  File "/data/dubingnan/dbn-ceph/twinkle/src/twinkle_client/http/http_utils.py", line 157, in http_post
    return _handle_response(response)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/dubingnan/dbn-ceph/twinkle/src/twinkle_client/http/http_utils.py", line 85, in _handle_response
    raise requests.HTTPError(http_error_msg, response=response)
requests.exceptions.HTTPError: 500 Error for url: http://10.178.165.81:8000/api/v1/model/Qwen/Qwen2.5-Coder-7B-Instruct/twinkle/calculate_loss
Server detail:
Internal Server Error

I found that the calculate_loss method in MultiLoraTransformersModel alters the base class's distributed semantics, causing incorrect calculations under multi-GPU DP distributed training.

Paste your experiment result here(if needed).

Updated remote_function decorators to specify collection methods.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the MultiLoraTransformers class by adding collection strategies to remote function decorators. Specifically, it configures calculate_loss to use a 'mean' collection strategy for aggregating losses across ranks and get_state_dict to use a 'first' collection strategy for efficient state retrieval. I have no feedback to provide as the review comments were explanatory in nature.

@tastelikefeet tastelikefeet merged commit 95ec7d8 into modelscope:main Apr 21, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants