Skip to content

fix npu compatibility#13465

Open
HsiaWinter wants to merge 3 commits intohuggingface:mainfrom
HsiaWinter:add_npu_compatibility
Open

fix npu compatibility#13465
HsiaWinter wants to merge 3 commits intohuggingface:mainfrom
HsiaWinter:add_npu_compatibility

Conversation

@HsiaWinter
Copy link
Copy Markdown
Contributor

What does this PR do?

Fix attention_mask broadcasting for NPU compatibility

@github-actions github-actions bot added models size/S PR with diff < 50 LOC labels Apr 14, 2026
@github-actions github-actions bot added size/S PR with diff < 50 LOC and removed size/S PR with diff < 50 LOC labels Apr 14, 2026
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR, I left one question


if attention_mask.ndim == 4:
# NPU does not support automatic broadcasting for this type; the mask must be expanded.
if attention_mask.device.type == 'npu' and attention_mask.shape[1:3] == (1, 1):
Copy link
Copy Markdown
Collaborator

@yiyixuxu yiyixuxu Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we verify if we explicitly seet the backend to npu, this would also work?

def _native_npu_attention(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a mask of shape [batch, seq_len] or [batch, 1, 1, seq_len] is passed, the operator fails with an error similar to:
get unsupported atten_mask shape, the shape is [B, 1, 1, S] – while only shapes like [B, N, S, S], [B, 1, S, S], [1, 1, S, S], or [S, S] are accepted.

The _native_npu_attention function operates correctly as it leverages _maybe_modify_attn_mask_npu to reshape the attention mask from [batch_size, seq_len_k] to [batch_size, 1, seq_len_q, seq_len_k]. This reshaped format is compatible with the NPU backend.

Reference:
Ascend NPU fusion attention API:
https://www.hiascend.com/document/detail/zh/Pytorch/730/apiref/torchnpuCustomsapi/docs/context/torch_npu-npu_fusion_attention.md

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a mask of shape [batch, seq_len] or [batch, 1, 1, seq_len] is passed, the operator fails with an error

just want to make sure we're on the same page, could you share a code example that would produce this error on npu? specifically, I;d like to know if you are running the default attention backend, i.e. without wrapping your model call inside with attention_backend("_naive_npu")

Copy link
Copy Markdown

@chang-zhijie chang-zhijie Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this code fixes the issue with the "native" backend. After the fix, it runs correctly with the "_native_attention" backend. Here's an example:

python
import torch
import torch_npu
from diffusers import ErnieImagePipeline
from diffusers.utils import load_image

pipe = ErnieImagePipeline.from_pretrained("/model_dir/ERNIE-Image", torch_dtype=torch.bfloat16)
pipe = pipe.to("npu")
generator = torch.Generator(device="npu")

prompt = "A black and white Chinese rural dog"
images = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    num_inference_steps=50,
    guidance_scale=5.0,
    generator=generator,
    use_pe=True,
).images
images[0].save("ernie-image-output.png")

However, I've found that when using native_npu_attention as the backend, there are still some issues with mask handling. I've pushed an additional commit to the previous PR. PR link: #13451 .
Add the following code to the example to enable the _native_npu_attention backend: pipe.transformer.set_attention_backend("_native_npu")
Note: When processing masks, we need to perform expansion validation when 4D masks are passed in, and apply mask inversion to meet the NPU interface requirements.

@yiyixuxu
Copy link
Copy Markdown
Collaborator

Thank you both for your inputs

I looked into this a bit more. I wonder if we can build the mask directly in 2D and not expand to 4D instead? https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_ernie_image.py#L398-L400

This way, I think it would work out-of-box with all our attention backend that support masks. For npu device, it would work using "_native_npu" backend but not the default naive backend - but that's the case with all other models currently

@chang-zhijie, Can you help confirm if a 2D mask would work with "_naive_npu" backend?

If it's the case, that would be our preferred direction, but if the baidu team prefers an implementation work out of the box with the default backend on npu too, we're happy to support it as well, let us know @HsiaWinter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

models size/S PR with diff < 50 LOC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants