Skip to content

feat: add inference.empty_cache_per_design flag to reduce CUDA allocator fragmentation#451

Open
leonardomarino wants to merge 2 commits intoRosettaCommons:mainfrom
leonardomarino:fix/empty-cache-per-design
Open

feat: add inference.empty_cache_per_design flag to reduce CUDA allocator fragmentation#451
leonardomarino wants to merge 2 commits intoRosettaCommons:mainfrom
leonardomarino:fix/empty-cache-per-design

Conversation

@leonardomarino
Copy link
Copy Markdown

@leonardomarino leonardomarino commented Apr 21, 2026

Problem

When running RFdiffusion with variable-length contigs (e.g. contigmap.contigs=[A1-469/0 1-50]) over hundreds or thousands of designs, per-worker VRAM grows steadily from ~7 GB to 10–13 GB per process. This limits how many workers can run in parallel on a single GPU before exhausting VRAM.

Root cause: PyTorch's CUDA caching allocator accumulates fragmented memory blocks across designs. With variable-length contigs each design allocates differently-sized tensors; freed blocks are cached but cannot be reused for different-sized allocations, causing steady VRAM growth.

Fix

Add an optional inference.empty_cache_per_design flag (default False, opt-in) that calls torch.cuda.empty_cache() at the end of each design iteration. This releases all unused cached CUDA memory blocks back to the CUDA memory manager, keeping each worker near its initial VRAM footprint for the full run.

Changes

config/inference/base.yaml

  write_trajectory: True
  empty_cache_per_design: False   # NEW

scripts/run_inference.py — after the trajectory/PDB write block, before log.info:

        if conf.inference.empty_cache_per_design and torch.cuda.is_available():
            torch.cuda.empty_cache()

        log.info(f"Finished design in {(time.time()-start_time)/60:.2f} minutes")

Measured impact

Tested on NVIDIA RTX 5090 32 GB running a long PPI campaign with variable-length contigs:

Setting Per-worker VRAM (steady-state)
Without fix 8–13 GB (grows over run)
With empty_cache_per_design=True ~5.2 GB (stable)

This allowed raising the number of parallel workers from 3 to 5 on a 32 GB GPU.

Why opt-in

torch.cuda.empty_cache() adds a small per-design overhead (~1–2 ms) and is only beneficial for long runs with variable-length contigs. For short runs or fixed-length designs there is no fragmentation issue, so the default remains False to preserve existing behavior.

Testing

All 20 applicable tests in tests/test_diffusion.py pass with this change. The one skipped test (design_ppi_scaffolded) fails due to a missing ppi_scaffolds/ directory in the test fixture — a pre-existing issue unrelated to this PR.

Notes

  • Placement is after both the PDB write (writepdb) and the optional trajectory block — every consumer of denoised_xyz_stack / px0_xyz_stack has already finished before the cache is cleared.
  • This does not affect memory held by live tensors — only frees cached-but-unused blocks.
  • Compatible with all existing RFdiffusion design modes (PPI, motif scaffolding, unconditional).

@leonardomarino
Copy link
Copy Markdown
Author

This failure is pre-existing and unrelated to this PR.

CI metadata:

  • Test: ubuntu-20.04.clang.python39.rfd — revision №75, daemon nobu-1
  • Started: 2026-04-21 11:09:01 · Run time: 0:06:06 · State: script failed

The ~6 minutes was entirely environment setup (virtualenv creation + package installation). The actual test crashed instantly at module import — before any
RFdiffusion inference code was reached:

import dgl
→ dgl/graphbolt/base.py
→ from torchdata.datapipes.iter import IterDataPipe
ModuleNotFoundError: No module named 'torchdata.datapipes'

torchdata.datapipes was removed in torchdata >= 0.7.0. The DGL version in the benchmark environment pulls in graphbolt, which requires it. This is a dependency pin
regression in the CI environment — my inference.empty_cache_per_design changes are not on this import path and have no bearing on the failure.

The CI report itself attributes the breakage to commit №20 by Junior Martins, not to this PR. The second traceback (shutil.move: FileNotFoundError: tests/outputs)
is a cascade — since the test never executed, the outputs/ directory was never created, causing the benchmark cleanup script to abort with script failed.

Recommended fix for CI: pin dgl < 2.0 or torchdata < 0.7.0 in the ubuntu-20.04.clang.python39.rfd test environment to restore the torchdata.datapipes API. This PR
should not be blocked on that environment issue.

@leonardomarino
Copy link
Copy Markdown
Author

@lyskov Could you please review this?

@lyskov lyskov requested a review from woodsh17 April 22, 2026 19:51
Copy link
Copy Markdown
Member

@lyskov lyskov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me. Also tagging @woodsh17 , Hope - thoughts on this? Thanks,

@leonardomarino
Copy link
Copy Markdown
Author

The CI failures are pre-existing and unrelated to this PR.

Root cause: RuntimeError: Numpy is not available in rfdiffusion/igso3.py:93 during Diffuser.init(). NumPy's C extension (_ARRAY_API) fails to load in the CI environment (Python 3.9 venv) — this is visible as a UserWarning: Failed to initialize NumPy: _ARRAY_API not found at the very start of the log, before any test logic runs.

The crash propagates up through IGSO3.init() → Diffuser.init() → SelfConditioning.initialize() → sampler_selector(), which is why line 54 of run_inference.py appears as the reported crash site — it is simply the outermost frame.

This PR's changes are post-loop cleanup at lines 191–201 of run_inference.py, which are only reached after a design completes successfully. They are unreachable from this failure. Running the same test suite against upstream main without these changes produces the identical error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants