Archive Scanning Subsystem

The archive scanning subsystem provides streaming archive parsing with deterministic budget enforcement for zip-bomb protection. It handles tar, gzip, bzip2, and zip formats, enforces nesting depth limits, controls decompressed output budgets at entry/archive/root scopes, and constructs virtual paths for files inside archives.

Module Purpose

crates/scanner-scheduler/src/archive/ defines the archive scanning contract: configuration, format detection, budget tracking, virtual path construction, outcome taxonomy, and the streaming scan loop. Format-specific handlers live in archive/formats/ (low-level parsers) and scheduler/local_fs_*.rs (blocking-path integration). The io_uring path delegates to archive/scan.rs via the ArchiveEntrySink trait.

The architecture emphasizes:

Streaming only: archives are parsed sequentially without materializing to disk
Deterministic budgets: every resource cap is enforced without non-determinism (wall-clock deadlines are opt-in)
Zero allocation after startup: all buffers are preallocated and reused via reset() / clear()
Sink-driven decoupling: the scan core delivers structured events via ArchiveEntrySink, allowing different consumers (pipeline, io_uring workers, simulation harness)

Supported Formats

Format	Kind	Extension(s)	Access Pattern	Entry Model
gzip	`Gzip`	`.gz`	Sequential `Read`	Single decompressed blob
bzip2	`Bzip2`	`.bz2`	Sequential `Read`	Single decompressed blob
tar	`Tar`	`.tar`	Sequential 512-byte blocks	Multiple named entries
tar+gzip	`TarGz`	`.tar.gz`, `.tgz`	Sequential gzip → tar	Multiple named entries
tar+bzip2	`TarBz2`	`.tar.bz2`, `.tbz2`	Sequential bzip2 → tar	Multiple named entries
zip	`Zip`	`.zip`	Random access via EOCD	Multiple named entries

Format Detection

Detection uses a two-phase algorithm (archive/detect.rs):

Extension-based (detect_kind_from_path / detect_kind_from_name_bytes): pure byte-suffix match using case-insensitive ASCII comparison (| 0x20). Single-byte dispatch on the last character gives O(1) common-case rejection. Extension detection always takes precedence — this is the only way to distinguish .tar.gz from plain .gz.
Magic-byte sniffing (sniff_kind_from_header): probes the first bytes when the extension is unrecognized. Probe order: gzip (1f 8b) → zip (PK..) → bzip2 (BZh + digit 1–9) → tar (ustar at offset 257 in a 512-byte header). First match wins.

The combined function detect_kind tries extension first, then falls back to magic bytes.

Magic Byte Signatures

Format	Minimum bytes	Signature
gzip	2	`0x1f 0x8b`
zip	4	`PK` + `(03,04)`, `(01,02)`, `(05,06)`, or `(07,08)`
bzip2	4	`BZh` + ASCII digit `'1'`–`'9'`
tar (ustar)	512	`"ustar"` at offset 257

Architecture

Integration with the Local FS Scanner

Archive scanning integrates at two levels:

Blocking path (scheduler/local_fs_owner.rs): the worker thread handles archives inline via dispatch_archive_scan() (scheduler/local_fs_archive_ctx.rs), which routes to format-specific handlers (local_fs_gzip.rs, local_fs_bzip2.rs, local_fs_tar.rs, local_fs_zip.rs). Blocking workers process one file at a time, so decompression blocking is acceptable.

io_uring path (scheduler/local_fs_uring.rs): archives are offloaded to dedicated archive worker threads to avoid stalling the io_uring completion loop. Detection happens at two points:

Discovery time: files with known archive extensions are routed directly to the archive channel
First-chunk classification: I/O threads sniff magic bytes after the first read completes

Both paths use the same archive subsystem types (ArchiveConfig, ArchiveBudgets, ArchiveScratch) and produce the same outcomes (ArchiveEnd).

Sink-Driven Entry Interface

The archive scan core (archive/scan.rs) drives an ArchiveEntrySink trait (scan.rs) that decouples parsing from downstream processing:

on_entry_start(&meta)        // exactly once per entry
  on_entry_chunk(chunk)      // zero or more payload windows
  on_entry_chunk(chunk)
  ...
on_entry_end()               // exactly once, even on truncation

The start/end pair is always balanced. The io_uring path implements ArchiveEntrySink via UringArchiveSink, which calls scan_chunk_into → drop_prefix_findings → dedupe → emit. The blocking path uses ArchiveScanCtx::scan_and_emit_chunk (local_fs_archive_ctx.rs).

Budget Enforcement

Budget tracking (archive/budget.rs) prevents resource exhaustion from zip bombs, deeply nested archives, and adversarial metadata. All accounting is deterministic and reproducible.

Budget Hierarchy

Three nested scopes, each with independent caps:

Root (per-source-file)
 └─ Archive (per-container: zip, tar, tar.gz, …)
     └─ Entry (per-file inside the container)

When charging decompressed output, the tightest remaining allowance across all three scopes wins. The binding constraint determines the BudgetHit variant, which tells callers whether to skip the entry, mark the archive partial, or stop the entire root.

Budget Caps

Cap	Config Field	Default	Scope
Nesting depth	`max_archive_depth`	3	Per root
Entry count	`max_entries_per_archive`	4096	Per archive
Entry output bytes	`max_uncompressed_bytes_per_entry`	64 MiB	Per entry
Archive output bytes	`max_total_uncompressed_bytes_per_archive`	256 MiB	Per archive
Root output bytes	`max_total_uncompressed_bytes_per_root`	512 MiB	Per root
Metadata bytes	`max_archive_metadata_bytes`	16 MiB	Per archive
Inflation ratio	`max_inflation_ratio`	128x	Per entry + per archive
Wall-clock deadline	`max_wall_clock_secs_per_root`	`None` (opt-in)	Per root

The nesting invariant entry <= archive <= root is enforced by ArchiveConfig::validate() (config.rs).

Inflation Ratio Enforcement

Ratio tracking runs at both archive and entry scopes to prevent a credit-accumulation attack. Per-entry ratio is tracked independently: entry_out <= entry_in * R. This prevents a pattern where many small well-compressed entries build up archive-level headroom that a single malicious entry later exploits (budget.rs).

When compressed input is zero (unknown), the ratio check is skipped to avoid false positives. The remaining_decompressed_allowance_with_ratio_probe(true) method applies a conservative 1-byte compressed-input assumption to cap the first read.

Budget Lifecycle Protocol

reset()                        // arm deadline, zero root counters
  enter_archive()              // push frame, enforce depth cap
    note_entry() / begin_entry() // count + open entry scope
      charge_compressed_in(n)   // raw bytes consumed
      charge_decompressed_out(n) // payload bytes delivered
      charge_discarded_out(n)    // payload bytes read but dropped
    end_entry(scanned)          // close entry scope
  exit_archive()               // pop frame

enter_archive/exit_archive and begin_entry/end_entry must be balanced. The frame stack is preallocated to max_archive_depth and never grows — no Vec push/pop on hot paths.

Budget Hit Classification

BudgetHit variants are ordered by increasing blast radius (budget.rs):

Variant	Scope	Effect
`SkipEntry`	Current entry only	Archive continues with next entry
`SkipArchive`	Entire archive	Discarded (no scan progress yet)
`PartialArchive`	Entire archive	Stops; bytes already scanned are kept
`StopRoot`	All archives under this root	Everything stops

ChargeResult::Clamp { allowed, hit } tells the caller the exact number of bytes it may still process before the limit takes effect.

Wall-Clock Deadline

The wall-clock deadline is opt-in (max_wall_clock_secs_per_root). When configured:

reset() arms an Instant-based deadline (the only place Instant::now() is called)
is_deadline_expired() is polled at natural loop boundaries
The deadline does not affect byte or count accounting
Maximum allowed value is 86,400 seconds (24 hours), enforced by MAX_WALL_CLOCK_SECS_PER_ROOT

In test/sim-harness builds, a deterministic countdown (set_deadline_check_countdown) replaces the real clock.

Key Types

Configuration

ArchiveConfig (config.rs) — shared archive scanning configuration. All limits are hard bounds. Archives are treated as hostile input.

Field	Type	Purpose
`enabled`	`bool`	Master enable switch
`max_archive_depth`	`u8`	Max nested archive depth
`max_entries_per_archive`	`u32`	Max entries per container
`max_uncompressed_bytes_per_entry`	`u64`	Per-entry decompressed byte cap
`max_total_uncompressed_bytes_per_archive`	`u64`	Per-archive decompressed byte cap
`max_total_uncompressed_bytes_per_root`	`u64`	Root-level cross-archive byte cap
`max_archive_metadata_bytes`	`u64`	Metadata parsing cap (headers, CD)
`max_inflation_ratio`	`u32`	Decompressed/compressed ratio cap
`max_virtual_path_len_per_entry`	`usize`	Max display path bytes per entry
`max_virtual_path_bytes_per_archive`	`usize`	Total path bytes per archive
`max_wall_clock_secs_per_root`	`Option<u64>`	Optional CPU-exhaustion deadline
`encrypted_policy`	`EncryptedPolicy`	How to handle encrypted content
`unsupported_policy`	`UnsupportedPolicy`	How to handle unsupported formats

EncryptedPolicy / UnsupportedPolicy (config.rs) — escalation ladders from SkipWithTelemetry → FailArchive → FailRun.

Budget Tracking

ArchiveBudgets (budget.rs) — deterministic budget tracker. Holds immutable caps from config and mutable counters. A fixed-size frame stack tracks per-archive state without allocation.

ArchiveFrame (budget.rs) — per-archive accounting frame (48 bytes, #[repr(C)]). Tracks entries_seen, entries_scanned, metadata_bytes, compressed_in, decompressed_out, entry_compressed_in, and entry_decompressed_out. Entry-open state uses a u64::MAX sentinel instead of a separate bool to avoid 7 bytes of padding.

BudgetHit (budget.rs) — classification of which budget was the binding constraint.

ChargeResult (budget.rs) — result of charging a byte quantity: Ok (full amount fits) or Clamp { allowed, hit } (partial).

Format Detection

ArchiveKind (detect.rs) — #[repr(u8)] enum: Gzip(0), Tar(1), Zip(2), TarGz(3), Bzip2(4), TarBz2(5). The is_container() method distinguishes multi-entry formats from single-stream formats.

Scan Core

ArchiveEntrySink (scan.rs) — trait decoupling archive parsing from downstream. Methods: on_entry_start, on_entry_chunk, on_entry_end.

EntryMeta (scan.rs) — metadata for a single entry: display_path, size_hint, flags.

EntryChunk (scan.rs) — one iteration of the sliding-window read loop: data (overlap prefix + new bytes), base_offset, new_bytes_start, new_bytes_len.

ArchiveScratch<Z> (scan.rs) — reusable scratch state. Contains EntryPathCanonicalizer, per-depth VirtualPathBuilders, ArchiveBudgets, per-depth TarCursors, ZipCursor, gzip header/name buffers, and the stream_buf. Preallocated to max_archive_depth + 2 depth slots.

ArchiveScanCtx (scan.rs, crate-private) — borrow-split view that decomposes ArchiveScratch into independent mutable borrows for recursive nesting via split_first_mut. Not part of the public API.

ArchiveEnd (scan.rs) — terminal outcome: Scanned, Skipped(ArchiveSkipReason), Partial(PartialReason).

Outcome Taxonomy

ArchiveSkipReason (outcome.rs) — 14 variants for why an entire archive was skipped before any payload bytes were scanned. #[repr(u8)] with stable discriminants used as array indices.

EntrySkipReason (outcome.rs) — 10 variants for why a specific entry was skipped. Entry skips do not abort the archive.

PartialReason (outcome.rs) — 12 variants for why an archive was only partially scanned. Partial outcomes retain results for bytes already processed.

ArchiveStats (outcome.rs) — per-worker aggregate with scalar counters, per-reason breakdown arrays, and a bounded sample ring (ArchiveSampleRing). All record_* methods are gated behind cfg!(all(feature = "perf-stats", debug_assertions)) for zero production overhead.

Virtual Paths

EntryPathCanonicalizer (path.rs) — sanitizes raw archive entry names into bounded, printable-ASCII display bytes. Resolves ./.., escapes non-printable bytes as %HH, enforces length and component caps.

VirtualPathBuilder (path.rs) — joins parent and entry display paths with :: separator. Truncation appends ~#<16-hex-digit> FNV-1a hash suffix.

CanonicalPath (path.rs) — result of canonicalization: bytes, had_traversal, truncated, component_cap_exceeded, hash64.

VirtualPath (path.rs) — result of virtual path construction: bytes, truncated, hash64.

Format-Specific Types

GzipStream<R> (formats/gzip.rs) — streaming gzip decoder wrapping flate2::MultiGzDecoder<CountedRead<R>>. Handles concatenated members. Reports compressed-byte deltas via take_compressed_delta().

Bzip2Stream<R> (formats/bzip2.rs) — streaming bzip2 decoder wrapping bzip2::MultiBzDecoder<CountedRead<R>>. Same delta-reporting interface as GzipStream.

CompressedStream trait (formats/mod.rs) — abstracts GzipStream and Bzip2Stream for generic scanning functions.

TarCursor (formats/tar.rs) — stateful tar header parser. Walks 512-byte header blocks, handles GNU longname (L) and PAX extended-header (x/g) records internally, yields TarEntryMeta. Zero allocation after startup.

TarRead trait (formats/tar.rs) — Read + optional take_compressed_delta() for compressed-byte accounting.

ZipCursor<R> (formats/zip.rs) — streaming cursor over a zip central directory. Parses EOCD, iterates CDFH entries, validates bounds. Supports Zip32 only; Zip64 sentinel values trigger UnsupportedFeature.

ZipSource trait (formats/zip.rs) — Read + Seek source with len() and try_clone(). Implemented for File, Cursor<Arc<[u8]>>, Cursor<Vec<u8>>.

ZipEntryReader (formats/zip.rs) — decompressed entry reader: Stored(CountedRead<LimitedRead>) or Deflate(DeflateDecoder<CountedRead<LimitedRead>>).

LimitedRead (formats/zip.rs) — bounds reads to a fixed byte count (compressed entry size).

CountedRead (util.rs) — Read wrapper that counts bytes consumed, driving inflation-ratio enforcement.

Data Flow

How an Archive File Is Discovered, Opened, Iterated, and Scanned

1. Discovery
   ├─ Extension match (detect_kind_from_path)
   │  └─ Route directly to archive workers (bypass I/O threads)
   └─ First-chunk magic sniff (sniff_kind_from_header)
      └─ I/O thread routes to archive channel

2. Archive Open
   ├─ reset() → arm deadline, zero root counters
   ├─ enter_archive() → push frame, enforce depth cap
   └─ Format-specific init:
      ├─ gzip: GzipStream::new_with_header (parse FNAME)
      ├─ bzip2: Bzip2Stream::new
      ├─ tar: TarCursor::reset
      ├─ tar.gz: GzipStream wrapping → tar iteration
      ├─ tar.bz2: Bzip2Stream wrapping → tar iteration
      └─ zip: ZipCursor::open (EOCD → central directory)

3. Entry Iteration
   For each entry:
   ├─ Canonicalize name (EntryPathCanonicalizer)
   ├─ Build virtual path with locator suffix (@t/@z/@c)
   ├─ Check path budget
   ├─ Skip non-regular entries (dirs, symlinks)
   ├─ Check for nested archive (detect_kind_from_name_bytes)
   │  ├─ If nestable and depth allows → recurse
   │  └─ If zip-in-tar → unsupported (no random access)
   ├─ begin_entry() → open entry budget scope
   └─ Sliding-window read loop:
      ├─ Check deadline
      ├─ Copy overlap carry to buffer front
      ├─ Probe remaining budget allowance
      ├─ Read up to min(chunk_size, allowance, buf capacity)
      ├─ charge_compressed_in() + charge_decompressed_out()
      ├─ Deliver EntryChunk to sink (or scan_and_emit_chunk)
      └─ Update offset/carry; break if budget clamped

4. Entry Close
   ├─ on_entry_end() / end_entry(scanned)
   ├─ Drain unconsumed payload (tar alignment)
   └─ Record entry stats (scanned/skipped/partial)

5. Archive Close
   ├─ exit_archive() → pop frame
   └─ Return ArchiveEnd (Scanned/Skipped/Partial)

Sliding-Window Read Loop

Every entry payload uses the same read pattern:

stream_buf layout on each iteration:

  |<-- carry (overlap) -->|<--- new read (up to chunk_size) --->|
  ^                       ^
  buf[0]                  buf[carry]

carry = overlap.min(bytes_emitted_so_far)

Before each read, the last carry bytes of the previous chunk are copied to the buffer front so downstream pattern matchers see a sliding window with overlap bytes of look-behind context. Budget checks happen after the read returns: bytes beyond the budget are truncated and the loop exits.

The upper bound on a single read is ARCHIVE_STREAM_READ_MAX (256 KiB), keeping per-iteration work bounded even with large chunk_size.

Format-Specific Details

Gzip (`scan_gzip_stream` / `process_gzip_file`)

Uses flate2::read::MultiGzDecoder to handle concatenated gzip members as one stream
Parses the optional gzip FNAME header field for the virtual entry name; falls back to <gunzip> when absent
Header parsing uses a bounded peek buffer (PeekRead) that is moved into the decoder and recovered afterward for reuse
Compressed-byte deltas tracked via CountedRead wrapping the raw reader
Inflation-ratio pre-clamping is always active (ratio_active = true)

Bzip2 (`scan_bzip2_stream` / `process_bzip2_file`)

Uses bzip2::read::MultiBzDecoder to handle concatenated bzip2 members
Virtual entry name is always <bunzip2> (bzip2 has no standard filename field)
Same CountedRead delta reporting as gzip
CPU exhaustion note: bzip2 block decompression can buffer up to 900 KiB internally per read() call. The deadline check fires between read iterations, not during a single decompression call, so a single block decode can run uninterrupted. Production deployments should set max_wall_clock_secs_per_root.

Tar (`scan_tar_stream` / `process_tar_file`)

Sequential 512-byte block parsing via TarCursor
Handles GNU longname (L) and PAX extended-header (x/g) records internally
PAX path= override applies per-file only (global PAX path is parsed but not applied to avoid misattribution)
Name resolution priority: PAX path > GNU longname > header name (with ustar prefix/name joining)
End-of-archive: two consecutive zero blocks or clean EOF at header boundary
Size fields parsed as NUL/space-padded ASCII octal; overflow (>21 digits) is rejected
Entries with tar typeflag 0 (NUL) or '0' (ASCII 0x30) are treated as regular files; everything else is skipped
is_zero_block uses word-wide (u64) unaligned reads with early exit for fast detection
After each entry's payload, any unconsumed bytes are drained and tar padding is consumed to maintain 512-byte alignment

Tar+Gzip (`scan_targz_stream` / `process_targz_file`)

Wraps the reader in GzipStream and delegates to scan_tar_stream with ratio_active = true
Inflation-ratio enforcement applies to the decompressed tar payload

Tar+Bzip2 (`scan_tarbz2_stream` / `process_tarbz2_file`)

Wraps the reader in Bzip2Stream and delegates to scan_tar_stream with ratio_active = true
Same inflation-ratio enforcement as tar+gzip

Zip (`scan_zip_source` / `process_zip_file`)

Requires random access (Read + Seek) via the ZipSource trait
Locates the end-of-central-directory (EOCD) record by scanning backward from the file end (up to 66 KiB window)
Validates: single-disk only, no Zip64 sentinel values (0xFFFF/0xFFFFFFFF)
Iterates central directory file headers (CDFH) sequentially
For each entry: reads the local file header (LFH) to locate the payload start
Supported compression methods: stored (method 0) and deflate (method 8)
Encrypted entries (flag bit 0) are handled per EncryptedPolicy
Compressed-byte deltas tracked manually (cumulative compressed bytes diffed between reads) because the zip reader does not expose per-read deltas like TarRead
Ratio pre-clamping active only for deflate entries; stored entries have 1:1 ratio
No recursive nesting: zip entries inside tar cannot be descended (no random access), handled per UnsupportedPolicy
Filename storage is bounded; oversized names are truncated with a streaming FNV-1a hash for the suffix

Nested Archive Handling

Tar entries whose names match a known archive extension are recursively descended up to max_archive_depth. The recursion uses split_first_mut to peel per-depth scratch slices (vpaths, path_budget_used, tar_cursors) without allocation. Each nesting level gets its own independent state while sharing the budget tracker and stream buffer.

Supported nesting paths:

tar → gzip, bzip2, tar, tar.gz, tar.bz2 (sequential streams)
tar → zip: not supported (zip requires random access; handled by UnsupportedPolicy)

Virtual Paths

Virtual paths are display-only identifiers for files inside archives. They are not filesystem paths and are never used to open files.

Construction

The full virtual path is assembled as: <parent_display>::<canonicalized_entry_name><locator_suffix>

Example: /tmp/outer.tar::inner.zip::dir/file.txt@t000000000000002a

The :: separator is chosen to be visually distinct from filesystem separators.

Canonicalization (`EntryPathCanonicalizer`)

Normalize separators (\ → /) and split into components
Drop .; resolve .. via a stack, clamping at root (traversal sets had_traversal flag)
Emit escaped display bytes: non-printable bytes → %HH (uppercase hex)
Stream FNV-1a hash over full (unbounded) output while storing only up to max_len bytes
If truncated, replace tail with ~#<16-hex-digit> hash suffix (avoids splitting %HH escapes at the boundary)
Component count capped at DEFAULT_MAX_COMPONENTS (256)

Locator Suffixes

Each virtual path is suffixed with a fixed-length locator for downstream re-seeking:

Suffix	Format	Value
`@t<16hex>`	tar	Header block index
`@z<16hex>`	zip	Local file header offset (when valid)
`@c<16hex>`	zip	CDFH offset (fallback when LFH offset invalid)

Gzip entries omit the locator because gzip contains exactly one decompressed stream.

Path Budget

Per-archive path byte usage is tracked in path_budget_used to prevent unbounded growth from archives with many entries having long paths. Exceeding max_virtual_path_bytes_per_archive triggers PartialReason::PathBudgetExceeded.

Error Handling

Corrupt or Malformed Archives

Truncated headers: read_exact_or_eof returns UnexpectedEof with format-labeled messages (e.g., "tar truncated", "zip truncated")
Bad magic: format detection returns None; the archive is treated as a regular file
Malformed size fields: tar parse_tar_size_octal returns None → PartialReason::MalformedTar
ZIP EOCD not found: ZipOpen::Stop(MalformedZip)
ZIP Zip64 sentinels: ZipOpen::Skip(UnsupportedFeature) or ZipNext::Stop(UnsupportedFeature)
Mid-stream corruption: compressed stream read errors → PartialReason::CompressedStreamCorrupt (gzip/bzip2) or PartialReason::MalformedTar/MalformedZip

Partial outcomes retain results for bytes already scanned. Skipped outcomes discard nothing (no bytes were scanned).

I/O Errors

File open failures → ArchiveEnd::Skipped(IoError)
Read errors during header/payload → ArchiveEnd::Partial with the appropriate format reason
No retry logic: each error is treated as fatal for that archive

Policy Escalation

EncryptedPolicy and UnsupportedPolicy provide three levels:

SkipWithTelemetry — skip and record (default)
FailArchive — abort the current archive
FailRun — set abort_run flag, abort the entire scan

Outcome Recording

Every archive encounter records exactly one top-level outcome via ArchiveStats:

record_archive_scanned() — fully processed
record_archive_skipped(reason, path, sample) — rejected before payload
record_archive_partial(reason, path, sample) — stopped mid-scan

Entry-level outcomes:

record_entry_scanned() — at least one payload byte scanned
record_entry_skipped(reason, path, sample) — rejected before payload
record_entry_partial(reason, path, sample) — stopped mid-entry (budget/corruption)

The bounded ArchiveSampleRing (32 samples, 192-byte path prefix each) captures the first N skip/partial events for diagnostics.

Constants & Tuning

Archive Configuration Defaults

Parameter	Default	Rationale
`max_archive_depth`	3	Covers `.tar.gz` containing a `.zip`; deeper nesting is adversarial
`max_entries_per_archive`	4096	Generous for real archives, bounds CPU in entry-counting loops
`max_uncompressed_bytes_per_entry`	64 MiB	Limits peak memory per entry
`max_total_uncompressed_bytes_per_archive`	256 MiB	Limits total archive output
`max_total_uncompressed_bytes_per_root`	512 MiB	Limits cross-archive output under 1 GiB
`max_archive_metadata_bytes`	16 MiB	Bounds header/CD parsing
`max_inflation_ratio`	128x	Accommodates high-compression formats; catches classic zip bombs
`max_virtual_path_len_per_entry`	1024 bytes	Bounds display path storage
`max_virtual_path_bytes_per_archive`	1 MiB	Bounds total path arena per archive
`max_wall_clock_secs_per_root`	`None`	Keeps defaults deterministic; production should opt in (e.g., 30s)
`DEFAULT_WALL_CLOCK_SECS_PER_ROOT`	30s	Suggested production value
`MAX_WALL_CLOCK_SECS_PER_ROOT`	86,400s	Upper bound to prevent `Instant` overflow

Internal Constants

Constant	Value	Location	Purpose
`ARCHIVE_STREAM_READ_MAX`	256 KiB	`scan.rs`	Upper bound on single decompressed read
`LOCATOR_LEN`	18 bytes	`scan.rs`	`@` + kind + 16 hex digits
`TAR_BLOCK_LEN`	512 bytes	`formats/tar.rs`	Tar header/data block size
`USTAR_MAGIC_OFFSET`	257	`formats/tar.rs`	Offset of `"ustar"` magic in tar header
`EOCD_MIN_LEN`	22 bytes	`formats/zip.rs`	Minimum end-of-central-directory size
`EOCD_SEARCH_MAX`	66 KiB	`formats/zip.rs`	Backward search window for EOCD
`CDFH_LEN`	46 bytes	`formats/zip.rs`	Central directory fixed header length
`LFH_LEN`	30 bytes	`formats/zip.rs`	Local file header fixed length
`DEFAULT_MAX_COMPONENTS`	256	`path.rs`	Max path components during canonicalization
`TRUNC_SUFFIX_LEN`	18 bytes	`path.rs`	`~#` + 16 hex digits
`ARCHIVE_SAMPLE_MAX`	32	`outcome.rs`	Max samples in bounded ring
`ARCHIVE_SAMPLE_PATH_PREFIX_MAX`	192 bytes	`outcome.rs`	Max path prefix per sample
`ENTRY_NOT_OPEN`	`u64::MAX`	`budget.rs`	Sentinel for entry-not-open state
`VIRTUAL_FILE_ID_BASE`	`0x8000_0000`	`local_fs_archive_ctx.rs`	High-bit namespace for virtual IDs

Frame Stack Sizing

ArchiveFrame is exactly 48 bytes (2 × u32 + 5 × u64, #[repr(C)], compile-time asserted). The frame stack is preallocated to max_archive_depth elements at construction and never grows.

Source of Truth

File	Purpose
`archive/mod.rs`	Module root, re-exports
`archive/config.rs`	`ArchiveConfig` struct + defaults + validation
`archive/budget.rs`	`ArchiveBudgets` struct
`archive/budget.rs`	`BudgetHit` enum
`archive/budget.rs`	`ChargeResult` enum
`archive/budget.rs`	`ArchiveFrame` struct
`archive/budget.rs`	`enter_archive()`
`archive/budget.rs`	`charge_decompressed_out()` — five-cap minimum logic
`archive/budget.rs`	`charge_discarded_out()` — bypasses per-entry output cap
`archive/detect.rs`	`ArchiveKind` enum
`archive/detect.rs`	`detect_kind_from_path()`
`archive/detect.rs`	`sniff_kind_from_header()`
`archive/detect.rs`	`detect_kind()` — combined detection
`archive/detect.rs`	`detect_kind_from_name_bytes()` — byte-level suffix matcher
`archive/outcome.rs`	`ArchiveSkipReason` (14 variants)
`archive/outcome.rs`	`EntrySkipReason` (10 variants)
`archive/outcome.rs`	`PartialReason` (12 variants)
`archive/outcome.rs`	`ArchiveStats` struct
`archive/outcome.rs`	`ArchiveSampleRing` struct
`archive/path.rs`	`EntryPathCanonicalizer` struct
`archive/path.rs`	`canonicalize()` method
`archive/path.rs`	`VirtualPathBuilder` struct
`archive/path.rs`	`build()` method
`archive/path.rs`	`apply_hash_suffix_truncation()`
`archive/scan.rs`	`ArchiveEnd` enum
`archive/scan.rs`	`EntryMeta` struct
`archive/scan.rs`	`EntryChunk` struct
`archive/scan.rs`	`ArchiveEntrySink` trait
`archive/scan.rs`	`ArchiveScratch<Z>` struct
`archive/scan.rs`	`ArchiveScanCtx` struct
`archive/scan.rs`	`scan_gzip_stream()`
`archive/scan.rs`	`scan_compressed_entry_stream()` — shared inner loop
`archive/scan.rs`	`scan_bzip2_stream()`
`archive/scan.rs`	`scan_tar_stream()`
`archive/scan.rs`	`scan_tar_stream_nested()` — recursive tar iteration
`archive/scan.rs`	`scan_targz_stream()`
`archive/scan.rs`	`scan_tarbz2_stream()`
`archive/scan.rs`	`scan_zip_source()`
`archive/util.rs`	`CountedRead` struct
`archive/util.rs`	FNV-1a hash functions
`archive/util.rs`	`write_u64_hex_lower()`
`archive/util.rs`	`read_exact_n()`
`archive/util.rs`	`budget_hit_to_partial()`
`archive/formats/mod.rs`	`CompressedStream` trait
`archive/formats/gzip.rs`	`GzipStream` struct
`archive/formats/gzip.rs`	`new_with_header()` — header parsing
`archive/formats/bzip2.rs`	`Bzip2Stream` struct
`archive/formats/tar.rs`	`TarRead` trait
`archive/formats/tar.rs`	`TarEntryMeta` struct
`archive/formats/tar.rs`	`TarNext` enum
`archive/formats/tar.rs`	`TarCursor` struct
`archive/formats/tar.rs`	`next_entry()` — header parsing loop
`archive/formats/zip.rs`	`ZipSource` trait
`archive/formats/zip.rs`	`ZipEntryMeta` struct
`archive/formats/zip.rs`	`ZipCursor` struct
`archive/formats/zip.rs`	`open()` — EOCD parsing
`archive/formats/zip.rs`	`next_entry()` — CDFH iteration
`archive/formats/zip.rs`	`open_entry_reader()` — LFH validation + reader construction
`archive/formats/zip.rs`	`ZipEntryReader` enum
`archive/formats/zip.rs`	`LimitedRead` struct
`scheduler/local_fs_archive_ctx.rs`	`alloc_virtual_file_id()`
`scheduler/local_fs_archive_ctx.rs`	`dispatch_archive_scan()`
`scheduler/local_fs_archive_ctx.rs`	`ArchiveEnd` (scheduler variant)
`scheduler/local_fs_archive_ctx.rs`	`ArchiveScanCtx` (blocking-path variant)
`scheduler/local_fs_archive_ctx.rs`	`scan_and_emit_chunk()`
`scheduler/local_fs_archive_ctx.rs`	`apply_entry_budget_clamp()`
`scheduler/local_fs_archive_ctx.rs`	`discard_remaining_payload()`
`scheduler/local_fs_archive_ctx.rs`	`scan_compressed_stream_nested()`
`scheduler/local_fs_gzip.rs`	`process_gzip_file()`
`scheduler/local_fs_bzip2.rs`	`process_bzip2_file()`
`scheduler/local_fs_tar.rs`	`process_tar_file()`, `process_targz_file()`, `process_tarbz2_file()`, recursive tar iteration
`scheduler/local_fs_zip.rs`	`process_zip_file()`
`scheduler/local_fs_extract.rs`	`extract_and_scan_file()` — binary extraction (non-archive)

FilesExpand file tree

archive-scanning.md

Latest commit

History

archive-scanning.md

File metadata and controls

Archive Scanning Subsystem

Module Purpose

Supported Formats

Format Detection

Magic Byte Signatures

Architecture

Integration with the Local FS Scanner

Sink-Driven Entry Interface

Budget Enforcement

Budget Hierarchy

Budget Caps

Inflation Ratio Enforcement

Budget Lifecycle Protocol

Budget Hit Classification

Wall-Clock Deadline

Key Types

Configuration

Budget Tracking

Format Detection

Scan Core

Outcome Taxonomy

Virtual Paths

Format-Specific Types

Data Flow

How an Archive File Is Discovered, Opened, Iterated, and Scanned

Sliding-Window Read Loop

Format-Specific Details

Gzip (scan_gzip_stream / process_gzip_file)

Bzip2 (scan_bzip2_stream / process_bzip2_file)

Tar (scan_tar_stream / process_tar_file)

Tar+Gzip (scan_targz_stream / process_targz_file)

Tar+Bzip2 (scan_tarbz2_stream / process_tarbz2_file)

Zip (scan_zip_source / process_zip_file)

Nested Archive Handling

Virtual Paths

Construction

Canonicalization (EntryPathCanonicalizer)

Locator Suffixes

Path Budget

Error Handling

Corrupt or Malformed Archives

I/O Errors

Policy Escalation

Outcome Recording

Constants & Tuning

Archive Configuration Defaults

Internal Constants

Frame Stack Sizing

Source of Truth

Gzip (`scan_gzip_stream` / `process_gzip_file`)

Bzip2 (`scan_bzip2_stream` / `process_bzip2_file`)

Tar (`scan_tar_stream` / `process_tar_file`)

Tar+Gzip (`scan_targz_stream` / `process_targz_file`)

Tar+Bzip2 (`scan_tarbz2_stream` / `process_tarbz2_file`)

Zip (`scan_zip_source` / `process_zip_file`)

Canonicalization (`EntryPathCanonicalizer`)