You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
~6.9K avg (tiny — agent is almost entirely orchestrating bash, not generating text)
Root cause: The workflow sequentially clones 8 repos and runs 18 test projects one-by-one, making a separate LLM request after each bash call. This compounds context: each request receives the full system prompt (~40K tokens) plus all accumulated bash output from prior steps. By request 15, the context has grown to ~43K tokens/request on average, totaling 621K input tokens for a workflow that produces only ~7K tokens of useful output.
Recommendations
1. Script-First Execution Pattern
Estimated savings: ~460K tokens/run (~74%)
Currently the agent executes 18 test projects interactively — one bash call per step, with each result appended to context. With 13–16 round trips, the rolling context alone accounts for 600K+ input tokens.
Fix: Restructure the prompt to have the agent write a single comprehensive bash script, execute it once, then summarize from log files. This collapses 14 round trips into 3.
Replace the current task-by-task prompt structure with:
## Instructions
Write a single bash script `/tmp/run-all-tests.sh` that:
1. Runs all 8 ecosystem setups and tests
2. Captures all output to `/tmp/test-results/<ecosystem>.log`3. Exits 0 regardless of individual test failures
Execute the script with `bash /tmp/run-all-tests.sh`, then read each log file
and post the combined summary table as a PR comment.
## Setup**Bun:**```bash
curl -fsSL (bun.sh/redacted) | bash
export BUN_INSTALL="$HOME/.bun" PATH="$BUN_INSTALL/bin:$PATH"
gh repo clone Mossaka/gh-aw-firewall-test-bun /tmp/test-bun
Test: cd /tmp/test-bun/elysia && bun install && bun test; cd /tmp/test-bun/hono && bun install && bun test
... (same pattern for each ecosystem)
**Expected request pattern after change:**
1. Agent writes `/tmp/run-all-tests.sh` (~2K output tokens)
2. Agent runs it (1 bash call, stdout captured in logs)
3. Agent reads log files and posts PR comment
**Token projection:**
| | Current | Projected |
|--|---------|-----------|
| LLM requests/run | ~14 | ~3 |
| Input tokens | ~614K | ~150K |
| Output tokens | ~7K | ~7K |
| **Total tokens** | **~621K** | **~157K** |
The cache hit rate should remain high (~90%) since the stable system prompt portion doesn't change.
---
### 2. Restrict GitHub MCP Toolset
**Estimated savings: ~154K tokens/run (~25%) at current request count; ~33K/run after Rec 1**
The `github:` tool in `build-test.md` loads without a `toolsets:` restriction. The default loads ~22 tool schemas into every LLM context. The workflow only needs:
- `create_pull_request_review_comment` or `add_issue_comment` (but these are handled by `safeoutputs`)
- Likely just `list_pull_requests` for PR context
**Fix:** Add `toolsets` restriction in `build-test.md`:
```diff
tools:
bash:
- "*"
github:
github-token: "$\{\{ secrets.GH_AW_GITHUB_MCP_SERVER_TOKEN }}"
+ toolsets: [pull_requests]
This reduces loaded tools from ~22 → ~4, saving ~11K tokens per request.
At current 14 requests/run: 14 × 11K = 154K tokens/run saved (-25%)
After Rec 1 (3 requests/run): 3 × 11K = 33K tokens/run saved (-21% of projected 157K)
3. Truncate Bash Output in Context
Estimated savings: ~40K tokens/run (~7%) — applies after Rec 1 to reduce log reading overhead
Even with the script-first approach, when the agent reads log files, verbose build output (cargo compile warnings, Maven download progress, npm install trees) gets added to context. Each ecosystem's full log can be 3–5K tokens.
Fix: Add output truncation instruction to the prompt:
When reading log files, use `tail -30` to capture only the final lines
(test results and exit codes). If a test fails, re-read the full log to
get error details.
This reduces per-log read from ~4K → ~0.5K tokens × 8 ecosystems = ~28K tokens saved.
4. Compress Prompt Body
Estimated savings: ~10K tokens/run (~1.6%)
The current 7,640-char prompt contains redundant elements that inflate the system context:
18-row output table template (~1,200 chars) — the agent doesn't need a pre-filled table to follow; just describe the format in one line
9 CRITICAL: annotations (~600 chars) — consolidate into a single error-handling section
Maven settings.xml block (~400 chars) — move to a pre-agent step: (see Rec 5)
Repetitive "Clone Repository / CRITICAL: If clone fails" pattern for all 8 ecosystems — abstract into a single rule
Then remove the entire "Configure Maven Proxy" section from the agent prompt. This eliminates a common agent error source (agent sometimes forgets or formats the XML incorrectly).
Expected Impact
Metric
Current
Projected
Savings
Total tokens/run
~621K
~120K
-81%
Effective tokens/run
~697K
~140K
-80%
LLM requests/run
~14
~3–4
-75%
Uncached input/run
~53K
~15K
-72%
Session duration
~107s
~35s (est.)
-67%
Cache hit rate
91%
90%
≈same
Projections assume Rec 1 (script-first) + Rec 2 (toolset restriction) are both applied.
Individual contribution: Rec 1 = -74%; Rec 2 = -21% on top (of reduced baseline); Rec 3 = -7% on top.
Target Workflow:
build-test.mdSource report: #1981
Estimated cost per run: N/A (api-proxy
estimated_costnot populated; token counts from AWF firewall logs)Total tokens per run: ~621K (avg of 5 complete runs; report avg 443K including 2 cancelled)
Effective tokens per run: ~697K
Cache hit rate: 91% (cache_read / total_input)
LLM requests per run: 13.8 avg (range: 11–16)
Model:
claude-sonnet-4.6Current Configuration
bash: ["*"],github:(no toolset restriction — loads ~22 tools)bash, github PR comment/label via safeoutputsdefaults,github,node,go,rust,crates.io,java,dotnet,bun.sh,deno.land,jsr.io,dl.deno.land(12 groups)Root cause: The workflow sequentially clones 8 repos and runs 18 test projects one-by-one, making a separate LLM request after each bash call. This compounds context: each request receives the full system prompt (~40K tokens) plus all accumulated bash output from prior steps. By request 15, the context has grown to ~43K tokens/request on average, totaling 621K input tokens for a workflow that produces only ~7K tokens of useful output.
Recommendations
1. Script-First Execution Pattern
Estimated savings: ~460K tokens/run (~74%)
Currently the agent executes 18 test projects interactively — one bash call per step, with each result appended to context. With 13–16 round trips, the rolling context alone accounts for 600K+ input tokens.
Fix: Restructure the prompt to have the agent write a single comprehensive bash script, execute it once, then summarize from log files. This collapses 14 round trips into 3.
Replace the current task-by-task prompt structure with:
Test:
cd /tmp/test-bun/elysia && bun install && bun test;cd /tmp/test-bun/hono && bun install && bun test... (same pattern for each ecosystem)
This reduces loaded tools from ~22 → ~4, saving ~11K tokens per request.
At current 14 requests/run: 14 × 11K = 154K tokens/run saved (-25%)
After Rec 1 (3 requests/run): 3 × 11K = 33K tokens/run saved (-21% of projected 157K)
3. Truncate Bash Output in Context
Estimated savings: ~40K tokens/run (~7%) — applies after Rec 1 to reduce log reading overhead
Even with the script-first approach, when the agent reads log files, verbose build output (cargo compile warnings, Maven download progress, npm install trees) gets added to context. Each ecosystem's full log can be 3–5K tokens.
Fix: Add output truncation instruction to the prompt:
This reduces per-log read from ~4K → ~0.5K tokens × 8 ecosystems = ~28K tokens saved.
4. Compress Prompt Body
Estimated savings: ~10K tokens/run (~1.6%)
The current 7,640-char prompt contains redundant elements that inflate the system context:
CRITICAL:annotations (~600 chars) — consolidate into a single error-handling sectionstep:(see Rec 5)Estimated compressed size: ~4,000 chars (-48%). This saves ~4K tokens × 14 requests × 7% uncached = ~4K uncached tokens (minor absolute impact but improves clarity).
5. Pre-Agent Step: Maven Settings
Estimated savings: ~400 chars of prompt; removes agent error surface
The Maven proxy settings are deterministic. Move them to a
steps:block:Then remove the entire "Configure Maven Proxy" section from the agent prompt. This eliminates a common agent error source (agent sometimes forgets or formats the XML incorrectly).
Expected Impact
Implementation Checklist
build-test.mdprompt to script-first pattern (write → execute → summarize)toolsets: [pull_requests]undergithub:inbuild-test.mdtail -30instruction for log reading in promptCRITICAL:blocks into one error-handling sectionsteps:pre-agent step to create~/.m2/settings.xml; remove from promptgh aw compile .github/workflows/build-test.mdnpx tsx scripts/ci/postprocess-smoke-workflows.tsData Sources
/tmp/gh-aw/token-audit/copilot-logs.json(7 runs analyzed; 5 with full token data).github/workflows/build-test.md(8,592 bytes)