Engine Transforms Module

Transform execution and streaming decode for URL percent-encoding and Base64 payloads.

Located in: crates/scanner-engine/src/engine/transform.rs

Module Overview

The transform module implements the transform stage of the scanner pipeline, responsible for:

Span Detection - Finding candidate URL-encoded and Base64-encoded spans in input buffers
Streaming Decode - Efficiently decoding spans with bounded memory and single-pass processing
Output Emission - Emitting decoded bytes via callbacks with internal buffering

The module trades precision for throughput: span finders use permissive scanning (favoring recall over strict validation), while decode operators enforce correctness through bounded-memory streaming decoders.

Transform Pipeline

Transform execution follows a two-stage model:

flowchart LR
    Input["Input Buffer<br/>(bytes)"]

    subgraph SpanFinding["Stage 1: Span Detection"]
        direction TB
        Prefilter["Quick Prefilter<br/>(transform_quick_trigger)"]
        URLSpan["URL Span Finder<br/>(find_url_spans_into)"]
        B64Span["Base64 Span Finder<br/>(find_base64_spans_into)"]
        SpanList["Span List<br/>(byte ranges)"]
    end

    subgraph Decoding["Stage 2: Streaming Decode"]
        direction TB
        Span["Span Range<br/>[start, end)"]
        Extract["Extract Bytes<br/>input[start..end]"]
        StreamDecode["Stream Decoder<br/>(stream_decode)"]
        Callbacks["Emit Bytes<br/>(on_bytes callbacks)"]
    end

    Output["Output<br/>(decoded bytes)"]

    Input --> Prefilter
    Prefilter --> URLSpan
    Prefilter --> B64Span
    URLSpan --> SpanList
    B64Span --> SpanList

    SpanList --> Span
    Span --> Extract
    Extract --> StreamDecode
    StreamDecode --> Callbacks
    Callbacks --> Output

    style Input fill:#e3f2fd
    style SpanFinding fill:#fff3e0
    style Decoding fill:#e8f5e9
    style Output fill:#e8eaf6

Stage 1: Span Detection

Span finders execute single-pass scans to identify candidate encoding spans:

URL Percent: Scans URL-ish runs (RFC3986 unreserved/reserved + '%' and '+'), keeps runs containing at least one trigger byte (%, and + when plus_to_space is enabled)
Base64: Scans base64 alphabet runs (including allowed whitespace), trims trailing whitespace

Both finders:

Are capped by max_len boundaries (splits long runs)
Stop after max_spans spans are found
Use fast byte-class lookup tables for O(1) per-byte classification

Stage 2: Streaming Decode

Streaming decoders process each span:

Single-pass, O(1) memory: No intermediate buffers; output emitted in chunks
Bounded chunks: 16 KiB output buffer (STREAM_DECODE_CHUNK_BYTES)
Callback-based: Emit decoded bytes via on_bytes callbacks; caller can abort early via ControlFlow::Break()
Error handling: URL decode is infallible; Base64 decode errors stop the stream and some bytes may have been emitted before error

Span Detection Details

URL Percent-Encoding Spans

Span Finder: find_url_spans_into() and UrlSpanStream

The URL span finder identifies runs of "URL-ish" characters that contain at least one encoding trigger:

// A span qualifies if:
// 1. Run is entirely URL-ish (RFC3986 unreserved/reserved + '%' + '+')
// 2. Contains at least one trigger byte
// 3. Trigger is '%' always; '+' is also a trigger when plus_to_space is enabled
// 4. Run length >= min_len and <= max_len

Byte Classification

URL-ish bytes include:

Alphanumerics: [A-Za-z0-9]
RFC3986 unreserved: -._~
RFC3986 reserved: :/?#[]@!$&'()*,;=
Encoding markers: % and +

Chunked Processing: UrlSpanStream

For streaming input, UrlSpanStream maintains state across chunk boundaries:

pub(super) struct UrlSpanStream {
    min_len: usize,              // Minimum span size
    max_len: usize,              // Maximum span size (splits long runs)
    plus_to_space: bool,         // '+' counts as trigger
    in_run: bool,                // Currently in a URL-ish run
    start: u64,                  // Absolute start offset
    run_len: usize,              // Current run length
    triggers: usize,             // Count of '%' (and '+' if enabled)
    done: bool,                  // Stream stopped early
}

Example: Span Finding

Input: "query=%48%65%6C%6C%6F next"
       ^^^^^^ URL-ish run with 5 '%' triggers

Output spans:
  [6, 21)  → "%48%65%6C%6C%6F" (triggers > 0, len >= min_len)

Note: `'+'` is only a trigger when `plus_to_space=true`.

Base64-Encoding Spans

Span Finder: find_base64_spans_into() and Base64SpanStream

The Base64 span finder identifies runs of base64 alphabet with optional allowed whitespace:

// A span qualifies if:
// 1. Run is base64 alphabet (A-Za-z0-9+/-_=) + allowed whitespace
// 2. Contains at least min_chars non-padding base64 chars
// 3. Span trimmed to last base64 byte (trailing whitespace removed)
// 4. Total run length <= max_len

Byte Classification

Base64 alphabet: [A-Za-z0-9+/=-_] (standard + URL-safe variants)
Allowed whitespace: \t, \r, \n, optionally space (if base64_allow_space_ws)
Whitespace not counted toward min_chars (but it does count toward max_len)

Chunked Processing: Base64SpanStream

pub(super) struct Base64SpanStream {
    min_chars: usize,            // Min non-padding base64 chars (excludes whitespace and '=')
    max_len: usize,              // Max run length including whitespace
    allow_space_ws: bool,        // Include space in allowed whitespace
    in_run: bool,                // Currently in base64 run
    pad_seen: bool,              // Seen '=' in current run (padding-tail mode)
    start: u64,                  // Absolute start offset
    run_len: usize,              // Total run length
    b64_chars: usize,            // Count of non-padding base64 chars
    have_b64: bool,              // Seen at least one non-padding base64 char
    last_b64: u64,               // Offset of last base64 byte (including '=')
    done: bool,                  // Stream stopped early
}

Example: Span Finding

Input: "data:SGVsbG8gV29y bGQ=extra"
        ^^^^^^^^^^^^^^^^^^^^^ base64 run (includes space)

Alphabet count: 16 (spaces not counted)
Output span: [5, 22)  → "SGVsbG8gV29y bGQ=" (trimmed to last base64 byte)

Decoding Logic

URL Percent-Decoding

Decoder: stream_decode_url_percent()

Decodes %HH escape sequences and optionally converts + to space:

// Rules:
// 1. "%HH" (where H is hex) → single byte (HH)
// 2. "%" not followed by 2 hex digits → passed through unchanged
// 3. "+" → space (if plus_to_space enabled)
// 4. All other bytes → passed through unchanged

Decode Process

Input:    "hello%20world%2B%XX+"
Process:
  "h"     → "h"          (literal)
  "e"     → "e"          (literal)
  "l"     → "l"          (literal)
  "l"     → "l"          (literal)
  "o"     → "o"          (literal)
  "%20"   → " "          (space, hex 20)
  "w"     → "w"          (literal)
  ...
  "%2B"   → "+"          (plus, hex 2B)
  "%XX"   → "%XX"        (invalid hex, passed through)
  "+"     → " "          (if plus_to_space enabled)

Output: "hello world+%XX " (assuming plus_to_space=true)

Hex Parsing

fn is_hex(b: u8) -> bool {
    b.is_ascii_hexdigit()  // [0-9a-fA-F]
}

fn hex_val(b: u8) -> u8 {
    // '0'..'9' → 0..9, 'a'..'f' → 10..15, 'A'..'F' → 10..15
}

Memory Bounds

Input: N bytes
Output: ≤ N bytes (escape sequences expand to 1 byte)
Streaming buffer: 16 KiB chunks

Base64 Decoding

Decoder: stream_decode_base64()

Decodes standard (+/) and URL-safe (-_) base64 alphabets:

// Processing:
// 1. Skip whitespace (space, tab, CR, LF)
// 2. Collect 4-byte quantum
// 3. Validate padding rules
// 4. Emit 1-3 decoded bytes per quantum
// 5. Accept unpadded tail (2-3 bytes)

Alphabet Mapping

A-Z → 0-25
a-z → 26-51
0-9 → 52-61
+ or - → 62
/ or _ → 63
= → padding marker
Other bytes → invalid

Decoding State Machine

Input bytes: a b c d (one quantum)

Padding patterns:
  [a b c d]        → 3 output bytes: (byte0 << 2) | (byte1 >> 4)
                                     ((byte1 & 0x0F) << 4) | (byte2 >> 2)
                                     ((byte2 & 0x03) << 6) | byte3

  [a b c =]        → 2 output bytes: (a << 2) | (b >> 4)
                                     ((b & 0x0F) << 4) | (c >> 2)

  [a b = =]        → 1 output byte:  (a << 2) | (b >> 4)

  [a = = =]        → ERROR (insufficient data)
  [a b = c]        → ERROR (padding in middle)
  [a = c d]        → ERROR (non-padding after padding)

Unpadded Tail Handling

2 characters → 1 decoded byte (sufficient data)
3 characters → 2 decoded bytes (sufficient data)
1 character → ERROR (TruncatedQuantum)

Example Decode

Input:  "SGVs bG8="  (includes space, standard padding)
        ^12 chars, 1 space

Process:
  S   → 18  (A-Z → 0-25)
  G   → 6
  V   → 21
  s   → 44  (a-z → 26-51)
  (space ignored)
  b   → 27
  G   → 6
  8   → 60  (0-9 → 52-61)
  =   → PAD

Quantum [18, 6, 21, 44]:
  (18 << 2) | (6 >> 4)           = 72 | 0 = 72   ('H')
  ((6 & 0xF) << 4) | (21 >> 2)   = 96 | 5 = 101  ('e')
  ((21 & 0x3) << 6) | 44         = 64 | 44 = 108 ('l')
Quantum [27, 6, 60, PAD]:
  (27 << 2) | (6 >> 4)           = 108 | 0 = 108 ('l')
  ((6 & 0xF) << 4) | (60 >> 2)   = 96 | 15 = 111 ('o')

Output: "Hello"

Budget Enforcement

The module enforces runtime limits to prevent pathological input from causing unbounded work:

Span Finder Budgets

Per-scan limits (configured via TransformConfig):

Budget	Field	Default	Purpose
Minimum span size	`min_len`	Configurable	Skip trivial spans
Maximum span size	`max_encoded_len`	Configurable	Cap scan depth; splits long runs
Maximum spans per buffer	`max_spans_per_buffer`	Configurable	Stop after N spans found

Run Splitting

Long runs are split at max_len boundaries to bound worst-case O(N) scans:

// Splitting does NOT align to encoding quanta
// - URL: %HH may be split into "%" and "HH"
// - Base64: 4-char quantum may be split

// Decoder treats each span independently:
// - URL decoder: split "%" parts become literals
// - Base64 decoder: fragment may fail strict validation

Decoder Budgets

Streaming decoders use O(1) memory:

Output buffer: 16 KiB (STREAM_DECODE_CHUNK_BYTES)
No intermediate allocations
Chunks emitted via callback

The module does not enforce per-span output caps itself. Engine callers enforce TransformConfig::max_decoded_bytes (for example via DecodeSlab::append_stream_decode).

Caller responsibility:

// Caller can enforce output-size limits by:
let mut total_out = 0;
on_bytes = |chunk| {
    total_out += chunk.len();
    if total_out > MAX_OUTPUT {
        return ControlFlow::Break(());  // Stop early
    }
    ControlFlow::Continue(())
};

Work Queue Processing

Quick Prefilter

Before expensive span scanning, a quick prefilter checks for trigger bytes:

pub(super) fn transform_quick_trigger(tc: &TransformConfig, buf: &[u8]) -> bool {
    match tc.id {
        TransformId::UrlPercent => {
            // URL spans require at least one '%'
            if memchr(b'%', buf).is_some() {
                return true;
            }
            // Or '+' if plus_to_space enabled
            if tc.plus_to_space && memchr(b'+', buf).is_some() {
                return true;
            }
            false
        }
        TransformId::Base64 => true,  // Span finder is the real filter
    }
}

Benefit: Avoids expensive full-buffer scan for non-matching buffers (e.g., binary data with no '%').

Span Sink Abstraction

Spans are collected into a flexible SpanSink trait:

pub(super) trait SpanSink {
    fn clear(&mut self);
    fn len(&self) -> usize;
    fn push(&mut self, span: Range<usize>);
}

Implementations:

Vec<Range<usize>> - Simple vector of ranges
ScratchVec<Range<usize>> - Reusable scratch memory
ScratchVec<SpanU32> - Compact u32-based spans (for large buffers)

Benefit: Allows callers to reuse allocations across multiple scans.

Dispatch Layer

The dispatch functions route to appropriate implementations based on TransformConfig::id:

pub(super) fn find_spans_into(tc: &TransformConfig, buf: &[u8], out: &mut impl SpanSink) {
    match tc.id {
        TransformId::UrlPercent => find_url_spans_into(...),
        TransformId::Base64 => find_base64_spans_into(...),
    }
}

pub(super) fn stream_decode(
    tc: &TransformConfig,
    input: &[u8],
    on_bytes: impl FnMut(&[u8]) -> ControlFlow<()>,
) -> Result<(), ()> {
    match tc.id {
        TransformId::UrlPercent => {
            stream_decode_url_percent(input, tc.plus_to_space, on_bytes);
            Ok(())
        }
        TransformId::Base64 => stream_decode_base64(input, on_bytes).map_err(|_| ()),
    }
}

Gate Policy: AnchorsInDecoded

transform.rs provides the streaming decode primitive, but gate enforcement is implemented in crates/scanner-engine/src/engine/core.rs and crates/scanner-engine/src/engine/stream_decode.rs.

For Gate::AnchorsInDecoded:

Base64 may first use Engine::base64_buffer_gate() on encoded bytes.
URL zero-hit paths may use url_percent_buffer_gate() on encoded bytes.
Decoded chunks are streamed through the decoded gate (vs_gate) and tracked across chunks.
If gate DB setup/scan fails, enforcement can be relaxed to avoid false negatives.

// Simplified engine behavior
let mut gate_hit = false;
stream_decode(tc, encoded_span, |decoded_chunk| {
    gate_hit |= gate_scan(decoded_chunk_with_tail);
    ControlFlow::Continue(())
})?;
if gate_is_enforced && !gate_hit {
    discard_decoded_span();
}

Note: The transform module itself does not decide gate outcomes; it only emits decoded chunks.

Key Functions and Data Structures

Byte Classification

const URLISH: u8 = 1 << 0;        // URL-ish run character
const B64_CHAR: u8 = 1 << 1;      // Base64 alphabet byte
const B64_WS: u8 = 1 << 2;        // Base64 whitespace (\t, \r, \n)
const B64_WS_SPACE: u8 = 1 << 3;  // Space character (if allowed)

static BYTE_CLASS: [u8; 256] = build_byte_class();  // Lookup table

// Lookups are O(1) per byte
let flags = BYTE_CLASS[b as usize];
let is_urlish = (flags & URLISH) != 0;
let is_b64 = (flags & B64_CHAR) != 0;

Base64 Decode Table

const B64_INVALID: u8 = 0xFF;        // Invalid byte marker
const B64_PAD: u8 = 64;              // Padding ('=') marker

static B64_DECODE: [u8; 256] = build_b64_decode_table();

// Per-byte decode: 0-63 for valid chars, B64_PAD for '=', B64_INVALID for invalid
let v = B64_DECODE[b as usize];
if v == B64_INVALID {
    return Err(Base64DecodeError::InvalidByte);
}

Core Struct: `UrlSpanStream`

Stateful scanner for URL spans over chunked input:

pub(super) struct UrlSpanStream {
    min_len: usize,
    max_len: usize,
    plus_to_space: bool,
    in_run: bool,
    start: u64,          // Absolute offset
    run_len: usize,
    triggers: usize,     // Count of '%' (and '+' if enabled)
    done: bool,
}

impl UrlSpanStream {
    pub(super) fn new(tc: &TransformConfig) -> Self { ... }
    pub(super) fn feed<F>(&mut self, chunk: &[u8], base_offset: u64, on_span: F)
    where F: FnMut(u64, u64) -> bool { ... }
    pub(super) fn finish<F>(&mut self, end_offset: u64, on_span: F)
    where F: FnMut(u64, u64) -> bool { ... }
}

Core Struct: `Base64SpanStream`

Stateful scanner for Base64 spans over chunked input:

pub(super) struct Base64SpanStream {
    min_chars: usize,
    max_len: usize,
    allow_space_ws: bool,
    in_run: bool,
    pad_seen: bool,
    start: u64,          // Absolute offset
    run_len: usize,
    b64_chars: usize,    // Non-padding alphabet bytes only
    have_b64: bool,
    last_b64: u64,       // Offset of last base64 byte (including '=')
    done: bool,
}

impl Base64SpanStream {
    pub(super) fn new(tc: &TransformConfig) -> Self { ... }
    pub(super) fn feed<F>(&mut self, chunk: &[u8], base_offset: u64, on_span: F)
    where F: FnMut(u64, u64) -> bool { ... }
    pub(super) fn finish<F>(&mut self, _end_offset: u64, on_span: F)
    where F: FnMut(u64, u64) -> bool { ... }
}

Streaming Decoders

fn stream_decode_url_percent(
    input: &[u8],
    plus_to_space: bool,
    on_bytes: impl FnMut(&[u8]) -> ControlFlow<()>,
);

fn stream_decode_base64(
    input: &[u8],
    on_bytes: impl FnMut(&[u8]) -> ControlFlow<()>,
) -> Result<(), Base64DecodeError>;

Internal Dispatch Functions

pub(super) fn transform_quick_trigger(tc: &TransformConfig, buf: &[u8]) -> bool;

pub(super) fn find_spans_into(tc: &TransformConfig, buf: &[u8], out: &mut impl SpanSink);

pub(super) fn stream_decode(
    tc: &TransformConfig,
    input: &[u8],
    on_bytes: impl FnMut(&[u8]) -> ControlFlow<()>,
) -> Result<(), ()>;

pub(super) fn map_decoded_offset(
    tc: &TransformConfig,
    encoded: &[u8],
    decoded_offset: usize,
) -> usize;

Edge Cases and Trade-offs

Span Splitting

Issue: Long runs are split at max_len boundaries, which may cut through encoding quanta.

Input (max_len=4): "%41%42"
Splits:
  Span 1: "%41%"
  Span 2: "42"

Decoder behavior:
  Span 1 -> "A%"   (trailing '%' is incomplete, stays literal)
  Span 2 -> "42"   (no leading '%', stays literal)

Rationale: Single-pass scanning must stay bounded; downstream decode inherits imprecision trade-off.

Mitigation: Decoder validates and handles fragments gracefully.

Whitespace Handling in Base64

Issue: Whitespace is ignored for min_chars, but it still consumes max_len budget.

Input:  "SGVs\nbG8="  (8 alphabet bytes + 1 newline + padding)
Config: min_chars=4, max_len=8

Result:
  b64_chars counts only non-padding alphabet bytes
  run_len counts all allowed bytes (alphabet + whitespace + '=')

  The run can split earlier because whitespace and '=' still advance run_len.

Rationale: min_chars tracks meaningful base64 content while max_len still bounds scan work.

Callback Early Exit

Design: Decoders respect ControlFlow::Break(()) to allow callers to abort decoding early.

// Example: limit output to 1 MB
stream_decode_url_percent(&input, true, |chunk| {
    total_out += chunk.len();
    if total_out > 1024 * 1024 {
        return ControlFlow::Break(());  // Stop, return Ok(())
    }
    process_chunk(chunk);
    ControlFlow::Continue(())
});

Effect: Output may be partial; caller must track bytes seen before break.

Configuration

Transform behavior is controlled via TransformConfig:

pub struct TransformConfig {
    pub id: TransformId,                    // UrlPercent or Base64
    pub mode: TransformMode,                // Disabled | IfNoFindingsInThisBuffer | Always
    pub gate: Gate,                         // None | AnchorsInDecoded
    pub min_len: usize,                     // Minimum span size
    pub max_spans_per_buffer: usize,        // Stop after N spans
    pub max_encoded_len: usize,             // Maximum span size (splits long runs)
    pub max_decoded_bytes: usize,           // Per-span decoded output cap (enforced by engine callers)
    pub plus_to_space: bool,                // URL: '+' → space
    pub base64_allow_space_ws: bool,        // Base64: allow space in whitespace
}

Performance Characteristics

Operation	Complexity	Notes
Span finding	O(N) single-pass	N = input buffer size
Byte classification	O(1)	Lookup table
URL decoding	O(N)	Single pass, O(1) memory
Base64 decoding	O(N)	Single pass, O(1) memory
Output buffering	O(K)	K = output buffer size (16 KiB)

Memory Usage:

Span finder state: fixed-size per stream instance (no heap allocations)
Decode buffer: 16 KiB stack allocation per decode call
No heap allocations during decode
Allocations via SpanSink (caller responsibility)

Testing Helpers

Test-only functions provide convenient interfaces with bounds checking:

#[cfg(test)]
fn decode_url_percent_to_vec(
    input: &[u8],
    plus_to_space: bool,
    max_out: usize,
) -> Result<Vec<u8>, UrlTestError>;

#[cfg(test)]
fn decode_base64_to_vec(
    input: &[u8],
    max_out: usize,
) -> Result<Vec<u8>, Base64TestError>;

Purpose: Simplified testing without streaming callback infrastructure.

Benchmark Helpers

Feature-gated benchmark functions for performance measurement:

#[cfg(feature = "bench")]
pub fn bench_stream_decode_url(input: &[u8], plus_to_space: bool) -> usize;

#[cfg(feature = "bench")]
pub fn bench_stream_decode_base64(input: &[u8]) -> usize;

Returns bytes successfully decoded (to verify computation wasn't optimized away).

Related Modules

engine::hit_pool - Span data structures (SpanU32)
api::TransformConfig - Configuration schema
engine::core / engine::stream_decode - Gate enforcement and streaming transform integration
scratch_memory::ScratchVec - Reusable buffer allocations
memchr - Fast byte searching (prefilter)

Summary

The transform module provides:

Permissive span finding with configurable budgets and early termination
Streaming decoders with O(1) memory and chunk-based emission
Graceful handling of malformed input (invalid escapes, padding errors)
Callback integration for gate policies and output bounds
High throughput via single-pass algorithms and lookup tables

The design emphasizes scanning throughput (permissive span finding) with correctness enforcement (strict decoding and optional gates).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Engine Transforms Module

Module Overview

Transform Pipeline

Stage 1: Span Detection

Stage 2: Streaming Decode

Span Detection Details

URL Percent-Encoding Spans

Base64-Encoding Spans

Decoding Logic

URL Percent-Decoding

Base64 Decoding

Budget Enforcement

Span Finder Budgets

Decoder Budgets

Work Queue Processing

Quick Prefilter

Span Sink Abstraction

Dispatch Layer

Gate Policy: AnchorsInDecoded

Key Functions and Data Structures

Byte Classification

Base64 Decode Table

Core Struct: `UrlSpanStream`

Core Struct: `Base64SpanStream`

Streaming Decoders

Internal Dispatch Functions

Edge Cases and Trade-offs

Span Splitting

Whitespace Handling in Base64

Callback Early Exit

Configuration

Performance Characteristics

Testing Helpers

Benchmark Helpers

Related Modules

Summary

FilesExpand file tree

engine-transforms.md

Latest commit

History

engine-transforms.md

File metadata and controls

Engine Transforms Module

Module Overview

Transform Pipeline

Stage 1: Span Detection

Stage 2: Streaming Decode

Span Detection Details

URL Percent-Encoding Spans

Base64-Encoding Spans

Decoding Logic

URL Percent-Decoding

Base64 Decoding

Budget Enforcement

Span Finder Budgets

Decoder Budgets

Work Queue Processing

Quick Prefilter

Span Sink Abstraction

Dispatch Layer

Gate Policy: AnchorsInDecoded

Key Functions and Data Structures

Byte Classification

Base64 Decode Table

Core Struct: UrlSpanStream

Core Struct: Base64SpanStream

Streaming Decoders

Internal Dispatch Functions

Edge Cases and Trade-offs

Span Splitting

Whitespace Handling in Base64

Callback Early Exit

Configuration

Performance Characteristics

Testing Helpers

Benchmark Helpers

Related Modules

Summary

Core Struct: `UrlSpanStream`

Core Struct: `Base64SpanStream`