feat: add inference.empty_cache_per_design flag to reduce CUDA allocator fragmentation by leonardomarino · Pull Request #451 · RosettaCommons/RFdiffusion

leonardomarino · 2026-04-21T15:02:39Z

Problem

When running RFdiffusion with variable-length contigs (e.g. contigmap.contigs=[A1-469/0 1-50]) over hundreds or thousands of designs, per-worker VRAM grows steadily from ~7 GB to 10–13 GB per process. This limits how many workers can run in parallel on a single GPU before exhausting VRAM.

Root cause: PyTorch's CUDA caching allocator accumulates fragmented memory blocks across designs. With variable-length contigs each design allocates differently-sized tensors; freed blocks are cached but cannot be reused for different-sized allocations, causing steady VRAM growth.

Fix

Add an optional inference.empty_cache_per_design flag (default False, opt-in) that calls torch.cuda.empty_cache() at the end of each design iteration. This releases all unused cached CUDA memory blocks back to the CUDA memory manager, keeping each worker near its initial VRAM footprint for the full run.

Changes

config/inference/base.yaml

  write_trajectory: True
  empty_cache_per_design: False   # NEW

scripts/run_inference.py — after the trajectory/PDB write block, before log.info:

        if conf.inference.empty_cache_per_design and torch.cuda.is_available():
            torch.cuda.empty_cache()

        log.info(f"Finished design in {(time.time()-start_time)/60:.2f} minutes")

Measured impact

Tested on NVIDIA RTX 5090 32 GB running a long PPI campaign with variable-length contigs:

Setting	Per-worker VRAM (steady-state)
Without fix	8–13 GB (grows over run)
With `empty_cache_per_design=True`	~5.2 GB (stable)

This allowed raising the number of parallel workers from 3 to 5 on a 32 GB GPU.

Why opt-in

torch.cuda.empty_cache() adds a small per-design overhead (~1–2 ms) and is only beneficial for long runs with variable-length contigs. For short runs or fixed-length designs there is no fragmentation issue, so the default remains False to preserve existing behavior.

Testing

All 20 applicable tests in tests/test_diffusion.py pass with this change. The one skipped test (design_ppi_scaffolded) fails due to a missing ppi_scaffolds/ directory in the test fixture — a pre-existing issue unrelated to this PR.

Notes

Placement is after both the PDB write (writepdb) and the optional trajectory block — every consumer of denoised_xyz_stack / px0_xyz_stack has already finished before the cache is cleared.
This does not affect memory held by live tensors — only frees cached-but-unused blocks.
Compatible with all existing RFdiffusion design modes (PPI, motif scaffolding, unconditional).

…esign=True

leonardomarino · 2026-04-22T12:22:54Z

This failure is pre-existing and unrelated to this PR.

CI metadata:

Test: ubuntu-20.04.clang.python39.rfd — revision №75, daemon nobu-1
Started: 2026-04-21 11:09:01 · Run time: 0:06:06 · State: script failed

The ~6 minutes was entirely environment setup (virtualenv creation + package installation). The actual test crashed instantly at module import — before any
RFdiffusion inference code was reached:

import dgl
→ dgl/graphbolt/base.py
→ from torchdata.datapipes.iter import IterDataPipe
ModuleNotFoundError: No module named 'torchdata.datapipes'

torchdata.datapipes was removed in torchdata >= 0.7.0. The DGL version in the benchmark environment pulls in graphbolt, which requires it. This is a dependency pin
regression in the CI environment — my inference.empty_cache_per_design changes are not on this import path and have no bearing on the failure.

The CI report itself attributes the breakage to commit №20 by Junior Martins, not to this PR. The second traceback (shutil.move: FileNotFoundError: tests/outputs)
is a cascade — since the test never executed, the outputs/ directory was never created, causing the benchmark cleanup script to abort with script failed.

Recommended fix for CI: pin dgl < 2.0 or torchdata < 0.7.0 in the ubuntu-20.04.clang.python39.rfd test environment to restore the torchdata.datapipes API. This PR
should not be blocked on that environment issue.

leonardomarino · 2026-04-22T12:27:43Z

@lyskov Could you please review this?

lyskov

Looks reasonable to me. Also tagging @woodsh17 , Hope - thoughts on this? Thanks,

leonardomarino · 2026-04-23T02:29:09Z

The CI failures are pre-existing and unrelated to this PR.

Root cause: RuntimeError: Numpy is not available in rfdiffusion/igso3.py:93 during Diffuser.init(). NumPy's C extension (_ARRAY_API) fails to load in the CI environment (Python 3.9 venv) — this is visible as a UserWarning: Failed to initialize NumPy: _ARRAY_API not found at the very start of the log, before any test logic runs.

The crash propagates up through IGSO3.init() → Diffuser.init() → SelfConditioning.initialize() → sampler_selector(), which is why line 54 of run_inference.py appears as the reported crash site — it is simply the outermost frame.

This PR's changes are post-loop cleanup at lines 191–201 of run_inference.py, which are only reached after a design completes successfully. They are unreachable from this failure. Running the same test suite against upstream main without these changes produces the identical error.

leonardomarino added 2 commits April 21, 2026 10:59

feat: add inference.empty_cache_per_design flag (off by default)

e302673

feat: call torch.cuda.empty_cache() per design when empty_cache_per_d…

e8bd8d1

…esign=True

lyskov requested a review from woodsh17 April 22, 2026 19:51

lyskov approved these changes Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add inference.empty_cache_per_design flag to reduce CUDA allocator fragmentation#451

feat: add inference.empty_cache_per_design flag to reduce CUDA allocator fragmentation#451
leonardomarino wants to merge 2 commits intoRosettaCommons:mainfrom
leonardomarino:fix/empty-cache-per-design

leonardomarino commented Apr 21, 2026 •

edited

Loading

Uh oh!

leonardomarino commented Apr 22, 2026

Uh oh!

leonardomarino commented Apr 22, 2026

Uh oh!

lyskov left a comment

Uh oh!

leonardomarino commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leonardomarino commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Changes

Measured impact

Why opt-in

Testing

Notes

Uh oh!

leonardomarino commented Apr 22, 2026

Uh oh!

leonardomarino commented Apr 22, 2026

Uh oh!

lyskov left a comment

Choose a reason for hiding this comment

Uh oh!

leonardomarino commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

leonardomarino commented Apr 21, 2026 •

edited

Loading