Skip to content

gh-148653: refactored marshal for cycle safety and performance#148700

Open
mjbommar wants to merge 6 commits intopython:mainfrom
mjbommar:marshal-safe-cycle-design
Open

gh-148653: refactored marshal for cycle safety and performance#148700
mjbommar wants to merge 6 commits intopython:mainfrom
mjbommar:marshal-safe-cycle-design

Conversation

@mjbommar
Copy link
Copy Markdown
Contributor

@mjbommar mjbommar commented Apr 17, 2026

This an experimental rewrite of marshal based on conversation with @serhiy-storchaka in #148652 and #148653.

Includes a number of extra docs and data generators that are provided only for reference during discussion.

So far it's green for me on test suite but need to dig in further and assume @serhiy-storchaka will have better intuition than me on any behavior or performance regressions.

In my first attempt, we hit minor single digit performance regressions in the loads path, unsurprisingly concentrated in the complicated cases.

Second attempt with improved performance coming in an hour or so.

(edit: now faster than HEAD with real performance and correctness gains)

Assisted by GPT-5.4 xhigh and Opus 4.7

Replace the PyList-backed reference table with a raw growable
PyObject ** array, and encode REF_STATE_INCOMPLETE_HASHABLE in the low
bit of each ref pointer so the parallel state-byte allocation is gone.

Also:
- drop the allow_incomplete_hashable parameter from r_object; it lives
  on RFILE now, auto-reset on entry, flipped via a wrapper at the two
  list-element / dict-value sites.
- force-inline the r_ref_* helpers so the compiler can fold the
  if (flag) guards into the callers as the original R_REF macro did.

Misc/marshal-perf-diary.md records the full experiment ledger: each
idea tested in isolation, results, and the combined stack. Benchmark
harness is /tmp/marshal_bench_cpu_stable.py (200k loads x 11 repeats,
taskset -c 0, best-of-3 pinned-run median).

Combined deltas vs main on loads:

    small_tuple    14.3% faster
    nested_dict     6.9% faster
    code_obj        6.8% faster

dumps is roughly flat to slightly faster. test_marshal passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mjbommar mjbommar changed the title gh-148652: refactored marshal for cycle safety and performance gh-148653: refactored marshal for cycle safety and performance Apr 17, 2026
Appends to Misc/marshal-perf-diary.md the results of the full test
suite rerun (48,932 tests pass, including the new RecursiveGraphTest
combinatoric cases) and a `pyperformance` comparison against main on
the same 10-benchmark marshal-adjacent slice the design doc used.

Significant results on the pyperformance slice:

    python_startup          1.18x faster (t=59.80)
    python_startup_no_site  1.03x faster (t=12.90)

All other slice benchmarks within noise; no regressions.

Adds Misc/marshal-perf-data/ with the raw JSON backing every table in
the diary: all per-experiment microbench runs (exp0..exp9, expC, final)
plus the two pyperf-slice JSONs and a README describing the layout and
reproduction commands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mjbommar
Copy link
Copy Markdown
Contributor Author

mjbommar commented Apr 17, 2026

@serhiy-storchaka @StanFromIreland - we're a long way from Kansas now, but I think the end result of this refactor is something much bigger: an 15-20% speedup in Python startup and a ~7-14% speedup in marshal.loads after the refactor.

(edit: the startup speed looks like a fixed 1-3ms saving, so that % is for bare startup)

I'm going to try running a few more complex tests like running some of my real apps or pipelines through this build to see if I can find regressions, but wanted to share.

I have no experience with big CPython changes like this but hopefully I didn't do anything too crazy here from an ABI or compatibility perspective.

Update: Confirmed that dill and cloudpickle test suites are unchanged (no new failures compared to 3.15 HEAD) and successfully ran a bunch of other day to day stuff without issue.

Records the outcome of an independent-library validation pass:

- dill 0.4.1 test suite (30 files) — identical 29/30 pass on baseline
  and HEAD; the single failure is a pre-existing 3.15a8 incompatibility
  in dill's module-state serialization, unrelated to marshal.
- cloudpickle 3.1.2 test suite (upstream) — 243/243 pass on both,
  identical skip/xfail breakdown.
- 1,601 marshal-adjacent stdlib tests (test_importlib, test_zipimport,
  test_compileall, test_py_compile, test_marshal) all pass on HEAD.
- compileall of CPython Lib/: +1.0% (within noise; dumps path untouched).
- Cold-import stress (56 stdlib modules, fresh subprocess): flat.
- Hypothesis fuzz (3500 random round-trips including cyclic shapes
  through mutable bridges): zero correctness regressions; acyclic
  round-trip -10%, list self-cycle -24%, dict value self-cycle -40%.

Nothing in the third-party validation hints at a correctness or
performance regression; several workloads that directly exercise the
changed code path are measurably faster.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Wulian233
Copy link
Copy Markdown
Contributor

Please remove your json and md files

@mjbommar
Copy link
Copy Markdown
Contributor Author

Please remove your json and md files

Sorry about that. What's the right way to share empirical benchmark data and experimental notes if they get to be too long for GH comments? I think @serhiy-storchaka and others will want to see the full set of micro-benchmarks I ran across all experiments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants