This document describes the commit walking subsystem: how the scanner discovers which commits to process, how commit history is represented in memory, and what safety limits prevent runaway traversal.
flowchart TD
Tips["Start Set Refs\n(tip + optional watermark)"]
Tips --> LoadOrGraph{"Graph Source?"}
LoadOrGraph -- "File-based commit-graph\n(gix_commitgraph)" --> FileGraph["CommitGraph trait\n(gix adapter)"]
LoadOrGraph -- "No commit-graph file" --> BFS["CommitLoader BFS\nfrom tips"]
BFS --> MemGraph["CommitGraphMem\n(SoA + CSR)"]
FileGraph --> Index["CommitGraphIndex\n(flat OID/tree/ts arrays)"]
MemGraph --> Index
Index --> PlanIter["CommitPlanIter\ntwo-frontier range walk"]
PlanIter --> TopoSort["topo_order_positions\n(Kahn's algorithm)"]
TopoSort --> Plan["Vec<PlannedCommit>\n(ancestor-first order)"]
Plan --> TreeDiff["Tree Diff Stage"]
| Type | Kind | Description | Location |
|---|---|---|---|
CommitGraph |
trait | Commit-graph access interface: lookup, generation, parents, OIDs, timestamps. Implementations must provide deterministic parent iteration. | commit_walk.rs |
PlannedCommit |
struct | Output of the walk: a Position and snapshot_root flag indicating whether to diff against the empty tree. |
commit_walk.rs |
ParentScratch |
struct | Reusable scratch buffer for parent collection. Stores up to 16 parents inline (no allocation); spills to Vec for octopus merges. |
commit_walk.rs |
VisitedCommitBitset |
struct | 1-bit-per-commit bitset for cross-ref emission dedup. test_and_set returns whether the bit was newly marked. |
commit_walk.rs |
CommitPlanIter |
struct | Iterator over (watermark, tip] commits for all refs. Uses two generation-ordered heaps and per-ref scratch. Deduplicates across refs via VisitedCommitBitset. |
commit_walk.rs |
RefWatermark |
struct | Persisted watermark state: oid: OidBytes + generation: NonZeroU32. The generation enables O(1) force-push detection before the ancestry walk. |
watermark_keys.rs |
CommitWalkLimits |
struct | Hard caps for traversal: max graph size, heap entries, parents per commit, new-ref skip checks. 16 bytes, compile-time validated. | commit_walk_limits.rs |
| Type | Kind | Description | Location |
|---|---|---|---|
LoadedCommit |
struct | Commit with OID, tree OID, parent OIDs (in object order), and committer timestamp. Produced by BFS loading. | commit_loader.rs |
CommitLoadLimits |
struct | Bounds for the loader: max commits, max object bytes, max parents, max delta depth, shallow file/root limits. | commit_loader.rs |
CommitLoadError |
enum | Fail-closed error type covering I/O, MIDX, pack, parse, delta, loose-object, and limit-violation failures. | commit_loader.rs |
ShallowBoundaryRoots |
enum | Compact shallow-root set (Empty / One / ManySorted) used to stop BFS at shallow boundaries. Binary-search lookup for ManySorted. |
commit_loader.rs |
| Type | Kind | Description | Location |
|---|---|---|---|
CommitGraphMem |
struct | In-memory commit graph with SoA layout (OIDs, trees, timestamps, generations) and CSR parent encoding. Implements CommitGraph trait. |
commit_graph_mem.rs |
| Type | Kind | Description | Location |
|---|---|---|---|
CommitGraphIndex |
struct | Cache-friendly index: flat arrays of commit OIDs, root tree OIDs, and timestamps copied from the commit graph. Optional per-commit identity IDs. O(1) positional lookups. | commit_graph.rs |
| Type | Kind | Description | Location |
|---|---|---|---|
ParsedCommit |
struct | Parsed commit fields: tree OID, parent OIDs, committer timestamp. | commit_parse.rs |
CommitParseLimits |
struct | Max commit bytes (default 1 MiB) and max parents (default 256). | commit_parse.rs |
CommitParseError |
enum | Parse errors: corrupt data, too large, too many parents, invalid hex, invalid OID length, invalid timestamp. | commit_parse.rs |
| Type | Kind | Description | Location |
|---|---|---|---|
CommitWalkLimits |
struct | Four u32 fields (16 bytes total) with DEFAULT and RESTRICTIVE const presets. Both validate() (panicking) and try_validate() (fallible) entry points. Compile-time validated via const _: (). |
commit_walk_limits.rs |
| Variant | Description | Location |
|---|---|---|
CommitGraphOpen |
Commit-graph file could not be opened or parsed. | errors.rs |
InvalidOidLength |
OID byte length does not match object format. | errors.rs |
CommitGraphTooLarge |
Graph exceeds max_commits_in_graph. |
errors.rs |
TipNotFound |
Tip commit OID missing from commit-graph. | errors.rs |
HeapLimitExceeded |
Combined heap frontier size exceeds max_heap_entries. |
errors.rs |
ParentDecodeFailed |
Corrupt parent list entry in commit-graph. | errors.rs |
TooManyParents |
Commit exceeds max_parents_per_commit. |
errors.rs |
TopoSortCycle |
Kahn's algorithm could not drain all commits. | errors.rs |
IdentityLengthMismatch |
Identity-ID vector length does not match commit count. | errors.rs |
InvalidGeneration |
Commit-graph returned generation 0, which violates the spec (generation numbers start at 1). | errors.rs |
The range walk implements git rev-list <tip> ^<watermark> semantics:
it yields exactly the commits in reachable(tip) - reachable(watermark).
CommitPlanIter (commit_walk.rs) maintains two max-heaps ordered by
(generation, position):
- Interesting frontier -- seeded with the tip commit. Expanded by following parent edges.
- Uninteresting frontier -- seeded with the watermark commit (if valid ancestor of tip). Expanded similarly.
Each iteration:
- Peek the highest-generation interesting commit (
top_i). - Call
advance_uninteresting(top_i.gen)(commit_walk.rs) to drain the uninteresting heap down to that generation, marking all encountered commits asMARK_U. - Pop
top_i. If markedMARK_U, skip it. Otherwise, expand its parents into the interesting frontier. - Check the global
VisitedCommitBitsetto deduplicate across refs. Emit only if newly marked.
This produces a roughly reverse-topological order (highest generation first) within each ref. Across refs, output follows the deterministic start-set ordering.
RefScratch (commit_walk.rs) stores one byte per commit with three
bit flags:
| Flag | Value | Meaning |
|---|---|---|
SEEN_I |
0x01 |
Seen in the interesting (tip) walk. |
SEEN_U |
0x02 |
Seen in the uninteresting (watermark) walk. |
MARK_U |
0x04 |
Known reachable from watermark; excluded from output. |
Between refs, only the touched positions are cleared via the touched list
-- O(touched) instead of O(N).
When a ref has no watermark (new ref), the walker checks whether its tip is
an ancestor of another ref's watermark (commit_walk.rs). If so, the
entire history was already scanned in a prior run and the ref is skipped.
The number of ancestry checks is bounded by max_new_ref_skip_checks
(default 1024) to avoid O(refs^2) cost.
Before running the ancestry walk, resolve_current_watermark
(commit_walk.rs) performs an O(1) generation comparison when the
persisted RefWatermark carries a generation number:
- Look up the watermark OID in the commit graph.
- If the OID is missing (GC'd or rewritten), discard the watermark.
- If the stored
NonZeroU32generation differs from the graph's generation at that position, the commit was likely force-pushed away or the graph was rebuilt — discard the watermark without an ancestry walk. - If generations match, proceed to the full
is_ancestorDFS. Matching generations alone do not prove ancestry because unrelated siblings can share a generation number.
The same validation runs in the new-ref skip loop: a stale generation on another ref's watermark prevents spurious skip decisions.
Watermarks persisted without a generation trailer (from a prior code revision) are rejected by the decoder and treated as absent, triggering a one-time full rescan for that ref.
is_ancestor (commit_walk.rs) performs a DFS from descendant
looking for ancestor, using generation numbers for pruning. If
gen(ancestor) > gen(descendant), the answer is immediately false.
Otherwise, the DFS skips any commit with gen < gen(ancestor).
topo_order_positions (commit_walk.rs) applies Kahn's algorithm over
the scanned subgraph to produce ancestor-first ordering:
- Build in-degree and adjacency using dense arrays sized to the full commit-graph (avoids hash maps).
- Seed a
PosQueuering buffer with zero-in-degree commits. - Drain the queue, decrementing children's in-degree and enqueuing when zero.
- If
ordered.len() != total, returnTopoSortCycle.
The public entry point introduced_by_plan (commit_walk.rs) combines
CommitPlanIter with topo_order_positions to produce the final
Vec<PlannedCommit> in ancestor-first order.
The commit graph can be sourced from either:
- File-based commit-graph (
gix_commitgraph) -- when the repository has pre-built.git/objects/info/commit-graphfiles. Accessed through theCommitGraphtrait. - In-memory graph (
CommitGraphMem) -- built by loading commits via BFS when no file-based graph is available.
Both implement the CommitGraph trait, so the walker is agnostic to the
source.
CommitGraphMem (commit_graph_mem.rs) uses Struct-of-Arrays (SoA) for
cache-friendly sequential access:
| Array | Layout | Size per commit |
|---|---|---|
commit_oids |
Flat byte array, N * oid_len |
20 B (SHA-1) or 32 B (SHA-256) |
root_trees |
Flat byte array, N * oid_len |
20 B or 32 B |
timestamps |
Vec<u64> |
8 B |
generations |
Vec<u32> |
4 B |
parent_start |
Vec<u32>, N+1 entries (CSR prefix sums) |
4 B |
parents |
Vec<u32>, flattened parent positions |
variable |
identity_ids |
Optional Vec<CommitIdentityIds> |
16 B (when enabled) |
CSR parent encoding: parent positions for commit at position i live at
parents[parent_start[i]..parent_start[i+1]]. This avoids one Vec
allocation per commit.
Deterministic ordering: commits are sorted by (generation ASC, oid ASC)
before position assignment. Input order does not affect positions.
Computed via Kahn's algorithm during CommitGraphMem::build
(commit_graph_mem.rs):
gen(root) = 1gen(commit) = 1 + max(gen(parent))for in-set parents- Parents outside the loaded set are treated as generation 0 (external roots)
- Cycles: unresolved commits are force-assigned
gen = 1; the count is exposed viaunresolved_commits()(commit_graph_mem.rs)
CommitGraphIndex (commit_graph.rs) copies OIDs, tree OIDs, and
timestamps into flat arrays so the backing graph can be dropped. All reads
are O(1) positional lookups. Optionally carries per-commit identity IDs
when identity enrichment is enabled.
load_commits_from_tips (commit_loader.rs) is the canonical loading
entry point:
- Seed a
VecDequefrontier with tip OIDs. - Pop OID from frontier.
- Resolve via MIDX (pack lookup) or fall back to loose-object directories.
- Inflate and parse the commit object.
- Collect parents; enqueue those not yet visited or queued.
- Commits in the shallow boundary set are loaded but their parents are not enqueued (preserving local history while stopping at shallow cuts).
- Repeat until frontier empty or limits exceeded.
Determinism: BFS order is deterministic given stable input tip order, stable parent ordering in commit objects, and stable pack contents.
Frontier management: two sets (visited, queued) prevent duplicate
frontier entries in merge-heavy DAGs while preserving first-seen ordering.
Pack access: pack files are memory-mapped lazily on first access and
cached per pack id for the duration of the load. Delta chains are resolved
recursively, bounded by max_delta_depth (default 64).
Identity enrichment: load_commits_with_identities
(commit_loader.rs) performs the same BFS but also extracts
author/committer identity from raw commit bytes and interns them.
parse_commit (commit_parse.rs) extracts three fields from a raw
commit object:
- Tree OID -- from the
tree <hex>\nline. - Parent OIDs -- from zero or more
parent <hex>\nlines. - Committer timestamp -- extracted by scanning the
committerline backwards for the second-to-last space-delimited field, tolerating spaces in names/emails.
Parsing stops after the committer line; gpgsig headers and the commit
message are not parsed. Complexity is O(header size), not O(commit size).
Limits enforced during parsing:
| Limit | Default | Error |
|---|---|---|
| Max commit object bytes | 1 MiB | CommitParseError::TooLarge |
| Max parents | 256 | CommitParseError::TooManyParents |
| Timestamp range | 0 to year 3000 | CommitParseError::InvalidTimestamp |
The timestamp parser (commit_parse.rs) uses checked arithmetic to
detect u64 overflow, preventing wraparound attacks that could bypass range
checks.
| Field | Default | Restrictive | Purpose |
|---|---|---|---|
max_commits_in_graph |
10,000,000 | 200,000 | Bounds bitset and scratch array sizes |
max_heap_entries |
2,000,000 | 50,000 | Live frontier size cap (interesting + uninteresting) |
max_parents_per_commit |
128 | 32 | Guards against octopus/corrupt merges. Capped at 255 (pipeline stores parent_idx as u8). |
max_new_ref_skip_checks |
1,024 | 128 | Caps ancestry checks for new-ref skip optimization |
Validation constraints (commit_walk_limits.rs):
- All fields must be > 0
max_commits_in_graph <= 100,000,000max_heap_entries <= max_commits_in_graphmax_parents_per_commit <= 255max_new_ref_skip_checks <= 1,000,000DEFAULTandRESTRICTIVEare validated at compile time viaconst _: ()assertions (commit_walk_limits.rs)
| Field | Default | Purpose |
|---|---|---|
max_commits |
10,000,000 | Stop loading after this many commits |
max_commit_object_bytes |
1 MiB | Decompressed commit size cap |
max_parents |
256 | Per-commit parent count cap |
max_delta_depth |
64 | Maximum delta chain depth |
max_shallow_file_bytes |
8 MiB | Reject oversized .git/shallow files |
max_shallow_roots |
256 | Reject excessive unique shallow roots |
Cycles are detected at two levels:
- Graph construction (
CommitGraphMem::build): Kahn's algorithm exposes unresolved commits. These are force-assigned generation 1 and counted inunresolved_commits. This is a soft recovery -- the graph remains usable. - Topological ordering (
topo_order_positions): if Kahn's algorithm fails to drain all commits in the scanned subgraph, it returnsCommitPlanError::TopoSortCycle. This is a hard error.
| Structure | Cost per commit | Notes |
|---|---|---|
VisitedCommitBitset |
1 bit | Cross-ref dedup; allocated once for the full graph |
RefScratch |
1 byte | Per-ref flags; cleared via touched list between refs |
| Heap frontiers | Variable | Bounded by max_heap_entries |
ParentScratch |
0 (inline <=16) | Heap spill only for >16 parents |
CommitGraphMem |
~56-72 bytes | SoA arrays + CSR edges |
CommitGraphIndex |
~48-68 bytes | Flat OID/tree/timestamp copies |
| Concept | Source File |
|---|---|
CommitGraph trait |
commit_walk.rs |
| Two-frontier range walk | commit_walk.rs |
advance_uninteresting |
commit_walk.rs |
is_ancestor (DFS + generation pruning) |
commit_walk.rs |
introduced_by_plan |
commit_walk.rs |
topo_order_positions (Kahn's) |
commit_walk.rs |
PlannedCommit |
commit_walk.rs |
ParentScratch |
commit_walk.rs |
VisitedCommitBitset |
commit_walk.rs |
CommitPlanIter |
commit_walk.rs |
RefScratch flags |
commit_walk.rs |
| BFS commit loading | commit_loader.rs |
CommitLoader (internal) |
commit_loader.rs |
LoadedCommit |
commit_loader.rs |
CommitLoadLimits |
commit_loader.rs |
ShallowBoundaryRoots |
commit_loader.rs |
| Shallow file loading | commit_loader.rs |
| Loose object loading | commit_loader.rs |
| Delta resolution (recursive) | commit_loader.rs |
| Identity enrichment | commit_loader.rs |
CommitGraphMem (SoA + CSR) |
commit_graph_mem.rs |
CommitGraphMem::build (generation propagation) |
commit_graph_mem.rs |
CommitGraphMem::build_with_identities |
commit_graph_mem.rs |
CommitGraph impl for CommitGraphMem |
commit_graph_mem.rs |
CommitGraphIndex |
commit_graph.rs |
parse_commit |
commit_parse.rs |
CommitParseLimits |
commit_parse.rs |
CommitParseError |
commit_parse.rs |
| Timestamp parsing (checked arithmetic) | commit_parse.rs |
CommitWalkLimits |
commit_walk_limits.rs |
CommitWalkLimits::DEFAULT |
commit_walk_limits.rs |
CommitWalkLimits::RESTRICTIVE |
commit_walk_limits.rs |
CommitWalkLimits::validate |
commit_walk_limits.rs |
CommitWalkLimits::try_validate |
commit_walk_limits.rs |
| Compile-time validation | commit_walk_limits.rs |
CommitPlanError |
errors.rs |
docs/scanner-git/git-scanning.md-- end-to-end pipeline overviewdocs/scanner-git/git-pack-execution.md-- pack decode stage