Skip to content

Add Gemma 4 text-decoder export to CoreML#19253

Open
john-rocky wants to merge 1 commit intopytorch:mainfrom
john-rocky:coreml/gemma4-text-decoder
Open

Add Gemma 4 text-decoder export to CoreML#19253
john-rocky wants to merge 1 commit intopytorch:mainfrom
john-rocky:coreml/gemma4-text-decoder

Conversation

@john-rocky
Copy link
Copy Markdown

Summary

The Gemma 4 text decoder shipped with examples/models/gemma4/text_decoder/
already implements hybrid sliding/full attention, partial RoPE,
per-layer head_dim (256 for sliding / 512 for full), MQA, and YOCO
KV sharing in plain PyTorch.

I checked, and that implementation lowers cleanly through
torch.export and CoreMLPartitioner today
— for the synthetic
10-layer Gemma 4 used in the new test, the lowered edge program
contains exactly executorch_call_delegate and getitem at the top
level (1186 MIL ops fully delegated). No portable fallbacks, no
unsupported ops.

So the missing piece is not new modeling code — it is the small amount
of glue that turns "exportable in principle" into "exportable from one
shell command". This PR adds that glue:

  • examples/apple/coreml/gemma4/export_gemma4_text_decoder_coreml.py,
    with sensible CoreML defaults: iOS18+ deployment target so the
    YOCO KV caches can be taken over as stateful tensors,
    compute_unit=CPU_AND_NE, fp16 by default (the ANE requires fp16).
  • A --random_weights mode for smoke-testing the export pipeline
    without a HuggingFace checkpoint, plus --config_json,
    --sliding_window, --sliding_window_pattern overrides.
  • A readme.md documenting the flags and the "everything delegates"
    property.
  • A BUCK target so the script is buildable in fbcode the same way
    the existing CoreML llama scripts are.

The audio and vision encoders are intentionally out of scope — the
existing ATen pipeline in examples/models/gemma4 is more appropriate
for those.

Test plan

examples/apple/coreml/gemma4/test.py builds a 10-layer synthetic
Gemma 4 (4 sliding + 1 full × 2) — same hybrid pattern as Gemma 4 E2B,
just at smaller dimensions — and runs the full export pipeline,
asserting the resulting .pte is non-empty.

$ python -m pytest examples/apple/coreml/gemma4/test.py -v
test.py::TestGemma4CoreMLExport::test_eager_forward_runs PASSED
test.py::TestGemma4CoreMLExport::test_full_export_pipeline_lowers_to_coreml PASSED
============================== 2 passed in 15.32s ==============================

I also ran the export by hand and confirmed the resulting edge program
is fully delegated.

Relationship to other open PRs

Authored with Claude.

The Gemma 4 text decoder shipped with examples/models/gemma4 already
implements hybrid sliding/full attention, partial RoPE, per-layer
head_dim, MQA, and YOCO KV sharing in plain PyTorch.  That
implementation lowers cleanly through torch.export and
CoreMLPartitioner — every node in the resulting edge program is a
single executorch_call_delegate and a getitem.  This script wires up
the small amount of glue needed for an on-device-friendly default:

* compile_specs targeting iOS18+ so the YOCO KV caches can be taken
  over as stateful tensors.
* fp16 by default (the ANE requires fp16).
* compute_unit=CPU_AND_NE so the runtime is free to keep ops on the
  ANE.
* Optional --random_weights mode for smoke-testing the export
  without a HuggingFace checkpoint, plus --config_json /
  --sliding_window / --sliding_window_pattern overrides.

Audio and vision encoders are intentionally out of scope here — the
existing ATen pipeline in examples/models/gemma4 is more appropriate
for those.

### Test plan

`test.py` builds a 10-layer synthetic Gemma 4 (4 sliding + 1 full
× 2) and runs the full export pipeline, asserting the resulting .pte
exists.

    $ python -m pytest examples/apple/coreml/gemma4/test.py -v
    test.py::TestGemma4CoreMLExport::test_eager_forward_runs PASSED
    test.py::TestGemma4CoreMLExport::test_full_export_pipeline_lowers_to_coreml PASSED
    ============================== 2 passed in 15.32s ==============================

I also ran the export by hand against the synthetic config and
confirmed the lowered edge program contains only `executorch_call_delegate`
and `getitem` at the top level.

Authored with Claude.
@john-rocky john-rocky requested a review from metascroy as a code owner May 1, 2026 06:03
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 1, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19253

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 1, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants