add an example of spmd for flux on v5e-8 by sayakpaul · Pull Request #13474 · huggingface/diffusers

sayakpaul · 2026-04-15T03:29:08Z

What does this PR do?

Add an example of model parallelism for Flux using PyTorch XLA. Tested on v5e-8.

Cc: @entrpn if you could review.

HuggingFaceDocBuilderDev · 2026-04-15T03:37:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

tengomucho · 2026-04-15T08:54:08Z

+        ckpt_id = "black-forest-labs/FLUX.1-dev"
+
+    # --- Text encoding (CPU) ---
+    prompt = "photograph of an electronics chip in the shape of a race car with trillium written on its side"


nit: again, for clarity I would avoid the "Trillium" word if we test on v5.

This is probably fine. It's quite separate.

tengomucho · 2026-04-15T08:58:59Z

+            xs.mark_sharding(param, mesh, tuple(spec))
+
+    flux_pipe.transformer.enable_xla_flash_attention(partition_spec=("data", None, None, None), is_flux=True)
+    FlashAttention.DEFAULT_BLOCK_SIZES = {


this looks like black magic, consider adding a comment explaining where these come from

Cc: @entrpn for those as it's copied from flux_inference.py.

If I remember correctly, these block sizes have been optimized for Trillium through some tests we ran internally. They can be kept as is as long the v5e's vmem can handle it. These could be optimized in the future specifically for v5e.

tengomucho · 2026-04-15T09:05:13Z

+
+def _vae_decode(latents, vae, height, width, device):
+    """Move VAE to XLA, decode latents, move VAE back to CPU."""
+    vae.to(device)


I do not know much about this, but isn't moving VAE back and forth between xla device and cpu quite expensive in time? Woudn't it be better just to keep it in XLA?

It would barely fit, otherwise. Plus we have to free some stuff anyway to do the actual computation. Once it's compiled it doesn't take much of a hit barring some displacement overhead which is likely justifiable given the cheap pricing of v5es. Does it make sense?

tengomucho · 2026-04-15T09:07:58Z

+2026-04-15 02:56:13 [info     ] avg. inference over 2 iterations took 98.75175104649975 sec.
+```
+
+The first inference iteration includes VAE compilation (~195s). The second iteration shows the true steady-state speed (~1.76s).


Perhaps you can include a dummy inference in the compilation part, so that VAE is compiled and timings look more regular.

I didn't get this part. Elaborate more? The block under "logger.info("starting compilation run...")" has the VAE compilation included.

sayakpaul · 2026-04-15T09:22:43Z

+
+def _vae_decode(latents, vae, height, width, device):
+    """Move VAE to XLA, decode latents, move VAE back to CPU."""
+    vae.to(device)


It would barely fit, otherwise. Plus we have to free some stuff anyway to do the actual computation. Once it's compiled it doesn't take much of a hit barring some displacement overhead which is likely justifiable given the cheap pricing of v5es. Does it make sense?

sayakpaul · 2026-04-15T09:23:36Z

+        ckpt_id = "black-forest-labs/FLUX.1-dev"
+
+    # --- Text encoding (CPU) ---
+    prompt = "photograph of an electronics chip in the shape of a race car with trillium written on its side"


This is probably fine. It's quite separate.

sayakpaul · 2026-04-15T09:23:55Z

+            xs.mark_sharding(param, mesh, tuple(spec))
+
+    flux_pipe.transformer.enable_xla_flash_attention(partition_spec=("data", None, None, None), is_flux=True)
+    FlashAttention.DEFAULT_BLOCK_SIZES = {


Cc: @entrpn for those as it's copied from flux_inference.py.

sayakpaul · 2026-04-15T09:25:07Z

+2026-04-15 02:56:13 [info     ] avg. inference over 2 iterations took 98.75175104649975 sec.
+```
+
+The first inference iteration includes VAE compilation (~195s). The second iteration shows the true steady-state speed (~1.76s).


I didn't get this part. Elaborate more? The block under "logger.info("starting compilation run...")" has the VAE compilation included.

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

sayakpaul · 2026-04-15T11:55:42Z

Additionally, @entrpn I am seeing recompilations with the following

  W torch_xla/csrc/runtime/pjrt_computation_client.cpp:682] Failed to deserialize executable: UNIMPLEMENTED: Deserializing serialized  executable not supported.

Is that expected?

entrpn · 2026-04-15T16:04:58Z

+            xs.mark_sharding(param, mesh, tuple(spec))
+
+    flux_pipe.transformer.enable_xla_flash_attention(partition_spec=("data", None, None, None), is_flux=True)
+    FlashAttention.DEFAULT_BLOCK_SIZES = {


If I remember correctly, these block sizes have been optimized for Trillium through some tests we ran internally. They can be kept as is as long the v5e's vmem can handle it. These could be optimized in the future specifically for v5e.

sayakpaul · 2026-04-15T16:55:17Z

@tengomucho a gentle ping

add an example of spmd for flux on v5e-8

4689714

sayakpaul requested a review from tengomucho April 15, 2026 03:29

github-actions bot added examples size/L PR with diff > 200 LOC labels Apr 15, 2026

Merge branch 'main' into flux-spmd-tpu

bf1fd9d

github-actions bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 15, 2026

tengomucho reviewed Apr 15, 2026

View reviewed changes

sayakpaul commented Apr 15, 2026

View reviewed changes

Apply suggestions from code review

b99078d

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

github-actions bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 15, 2026

sayakpaul added 2 commits April 15, 2026 15:52

add check

294fd1a

Merge branch 'main' into flux-spmd-tpu

cae90fd

github-actions bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 15, 2026

sayakpaul requested a review from tengomucho April 15, 2026 10:22

github-actions bot added the size/L PR with diff > 200 LOC label Apr 15, 2026

Merge branch 'main' into flux-spmd-tpu

323a7db

github-actions bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 15, 2026

entrpn approved these changes Apr 15, 2026

View reviewed changes

Conversation

sayakpaul commented Apr 15, 2026

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul commented Apr 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants