The active scan fs pipeline is orchestrated in crates/scanner-scheduler/src/scheduler/parallel_scan.rs
and executed by the scheduler in crates/scanner-scheduler/src/scheduler/. It does not use
file_ring/chunk_ring/out_ring stage queues in the current filesystem path.
flowchart LR
Path["Path / root"] --> Orch["parallel_scan_dir()<br/>(entry + setup)"]
Orch --> PScan["scan_local()<br/>(discovery + dispatch)"]
PScan --> Walker["IterWalker::next_file()"]
Walker --> Budget["CountBudget<br/>(max_in_flight_objects)"]
Budget --> Exec["Executor<FileTask>"]
Exec --> Worker["process_file()"]
Worker --> Detect["Archive detect<br/>(extension -> header sniff)"]
Detect -->|archive| Arch["dispatch_archive_scan()"]
Detect -->|regular file| ChunkLoop["Sequential read + overlap carry"]
ChunkLoop --> Engine["Engine::scan_chunk_into()"]
Engine --> Emit["CoreEvent::Finding via EventOutput"]
Engine --> Persist["emit_persistence_batch()<br/>via StoreProducer"]
Pool["TsBufferPool"] -.->|"acquire()"| ChunkLoop
ChunkLoop -.->|"TsBufferHandle::drop()"| Pool
Emit --> Summary["Summary event + sink.flush()"]
Persist -.->|"run end"| RunLoss["record_fs_run_loss()"]
- Entry point:
parallel_scan_dir(...) - Validates root, builds
IterWalker, then callsscan_local(...) - Emits final
CoreEvent::Summaryand flushes the sink
IterWalkerperforms single-threaded filesystem discovery- Produces
LocalFilevalues - Respects walker config (
follow_symlinks, hidden files, gitignore)
scan_local(...)enqueuesFileTaskvalues and runsExecutor<FileTask>CountBudgetenforces discovery backpressure (max_in_flight_objects)- Workers run
process_file(...):- Archive detection by extension, then header sniff when enabled
- Binary skip/extract gate (content-policy based)
- Sequential read with overlap carry (
copy_withintail -> head) Engine::scan_chunk_into(...)+drop_prefix_findings(...)- Optional within-chunk dedupe (includes
norm_hashin dedup key) +CoreEvent::Findingemission build_persistence_batch()+emit_persistence_batch()viaStoreProducer(when configured)
- Findings are emitted directly through
EventOutput(JSONL/Text/JSON/SARIF) - No filesystem-path
OutputStagequeue in the scheduler flow
- When
--persist-findingsis set, post-dedupe findings are also emitted to aStoreProducerviaemit_persistence_batch()at every scan site - Each batch carries the scanned object's path and its
FsFindingRecordvalues - At run end,
record_fs_run_loss()emits loss accounting (FsRunLoss) so the backend can decide whether the run is complete or partial - Errors from the producer are counted (
persistence_emit_failures) but do not abort the scan — fail-soft semantics - See fs-persistence-pipeline.md for full details
PipelineStats in crates/scanner-scheduler/src/pipeline.rs includes archive: ArchiveStats, while
the active scheduler report type is LocalReport/MetricsSnapshot.
sequenceDiagram
participant Pool as TsBufferPool
participant Worker as process_file()
participant File as std::fs::File
participant Engine as Engine
participant Sink as EventOutput
participant Store as StoreProducer
Worker->>Pool: acquire()
Pool-->>Worker: TsBufferHandle
loop until EOF or snapshot boundary
Worker->>File: read(payload after overlap carry)
Worker->>Engine: scan_chunk_into(data, file_id, base_offset, scratch)
Worker->>Worker: drop_prefix_findings + optional dedupe
Worker->>Store: emit_persistence_batch(path, findings)
Worker->>Sink: emit CoreEvent::Finding
end
Worker->>Pool: TsBufferHandle::drop()
Active filesystem defaults (high-level API):
| Setting | Default | Source |
|---|---|---|
ParallelScanConfig.chunk_size |
256 KiB |
crates/scanner-scheduler/src/scheduler/parallel_scan.rs |
ParallelScanConfig.pool_buffers |
workers * 4 |
crates/scanner-scheduler/src/scheduler/parallel_scan.rs |
ParallelScanConfig.max_in_flight_objects |
1024 |
crates/scanner-scheduler/src/scheduler/parallel_scan.rs |
ParallelScanConfig.local_queue_cap |
4 |
crates/scanner-scheduler/src/scheduler/parallel_scan.rs |
scan_local memory bound is approximately:
peak_buffer_bytes ~= pool_buffers * (chunk_size + engine.required_overlap())
For direct library usage, crates/scanner-scheduler/src/runtime.rs provides a single-threaded
ScannerRuntime + read_file_chunks(...) path with BufferPool.
- Backpressure is explicit via
CountBudgetand fixed-capacityTsBufferPool - Discovery is single-threaded and bounded; scanning is parallel owner-compute
- Buffer ownership is RAII (
TsBufferHandle), so release is deterministic on drop
Git scanning is staged and resource-bounded, but not strictly single-threaded:
- Parallelism knobs:
GitScanConfig.pack_exec_workers(pack decode/scan workers)GitScanConfig.blob_intro_workers(parallel blob introduction)
- Deterministic output ordering is preserved by ordered merge/reassembly in the runner
- Key bounded points:
SpillLimits(spill bytes, chunk candidates, run caps)MappingBridgeConfig(path_arena_capacity, candidate caps)PackPlanConfig(max_worklist_entries,max_delta_depth)PackMmapLimits(max_open_packs,max_total_bytes)PackDecodeLimits(header/delta/object byte limits)
When limits are hit, runs can fail or become partial; watermark writes are only advanced on complete finalize outcomes.