Preflight Checklist
Problem Statement
Claude Code's hook system exposes events tied to Claude's actions — PreToolUse, PostToolUse, Stop, UserPromptSubmit, etc. — but nothing fires when the harness itself fails: upstream 5xx responses, network errors, tool timeouts, or API rate limits surfaced mid-session.
Today these failures are only visible as error strings inside a tool result that Claude happens to read and relay. That makes them:
Invisible to automation. Users can't wire telemetry, paging, auto-retry-with-backoff, or desktop notifications to infra failures, because there's no event to hook.
Inconsistently surfaced. Whether a user learns their session hit a 500 depends on whether Claude decides to mention it in the response, which is non-deterministic.
Non-actionable for teams. Power users running Claude Code in CI, long-running agents, or /loop-style workflows have no way to distinguish "Claude failed the task" from "the harness failed Claude" without scraping transcripts.
What would solve it:
A hook event (e.g. ToolCallError or HarnessError) that fires on non-2xx responses from the tool infrastructure, network failures, and timeouts — with the tool name, error class, status code, and retry count in the payload. Same shape as existing hooks so it composes with current configs.
Why now:
As Claude Code gets used for longer, more autonomous workflows (scheduled agents, /loop, CI runs), silent infra failures become a correctness and observability problem, not just a UX papercut.
Proposed Solution
A new hook event, potentially ToolCallError. Fires whenever the harness fails to get a successful response from a tool. Can be prior to or instead of surfacing the failure to Claude.
Hypothetical payload:
{
"hook_event_name": "ToolCallError",
"tool_name": "Bash",
"tool_input": { ... },
"error": {
"kind": "http_5xx" | "network" | "timeout" | "rate_limit" | "other",
"status_code": 500,
"message": "Internal Server Error",
"retry_attempt": 2,
"will_retry": true
},
"session_id": "...",
"transcript_path": "..."
}
Matcher semantics: I'd believe similar as PreToolUse / PostToolUse — users can match by tool_name pattern, so you can scope hooks to Bash, WebFetch, MCP tools, etc. independently.
Fires: Once per failed attempt, including during retries (so observability tools see the full retry story), with a final event when the harness gives up vs. successfully retries.
Example configs this unlocks:
osascript -e 'display notification "Claude hit a 500"' — desktop ping on infra failure
Append to a local JSONL for session-level error telemetry
Post to Slack/Discord on repeated failures in a long-running /loop or scheduled agent
Play a song such as Killing in the Name Of at the 4M12S timestamp. (The original motivating use case. Completely Unserious, but a valid signal that the event surface is missing.)
Alternative Solutions
I had looked at PostToolUse, but it's unclear if it would be consistent on failed tool calls. Additionally, you'd likely have to string match which is brittle if things would change.
Priority
Medium - Would be very helpful
Feature Category
Configuration and settings
Use Case Example
This is mostly a joke, but I would likely not hesitate to set said hook up to either change the behavior around an error to like I mentioned above, play Killing in the Name Of at the 4m12s timestamp, such that Claude is telling me very clearly "No It won't do what I tell it" so that it gives me a good chuckle.
A more serious example, leveraging Claude for Triaging Incidents to pull new incidents, correlate them with deploys, check dashboards and structure a summary to an incident channel in Slack before on call is notified could result in broken or missing triage context, no message at all due to silent failures etc.
The hook for ToolCallError could reasonably unlock observability when Claude is running more autonomously, instead of producing silent failures that could lead to missing an SLA.
Additional Context
No response
Preflight Checklist
Problem Statement
Claude Code's hook system exposes events tied to Claude's actions — PreToolUse, PostToolUse, Stop, UserPromptSubmit, etc. — but nothing fires when the harness itself fails: upstream 5xx responses, network errors, tool timeouts, or API rate limits surfaced mid-session.
Today these failures are only visible as error strings inside a tool result that Claude happens to read and relay. That makes them:
Invisible to automation. Users can't wire telemetry, paging, auto-retry-with-backoff, or desktop notifications to infra failures, because there's no event to hook.
Inconsistently surfaced. Whether a user learns their session hit a 500 depends on whether Claude decides to mention it in the response, which is non-deterministic.
Non-actionable for teams. Power users running Claude Code in CI, long-running agents, or /loop-style workflows have no way to distinguish "Claude failed the task" from "the harness failed Claude" without scraping transcripts.
What would solve it:
A hook event (e.g. ToolCallError or HarnessError) that fires on non-2xx responses from the tool infrastructure, network failures, and timeouts — with the tool name, error class, status code, and retry count in the payload. Same shape as existing hooks so it composes with current configs.
Why now:
As Claude Code gets used for longer, more autonomous workflows (scheduled agents, /loop, CI runs), silent infra failures become a correctness and observability problem, not just a UX papercut.
Proposed Solution
A new hook event, potentially
ToolCallError. Fires whenever the harness fails to get a successful response from a tool. Can be prior to or instead of surfacing the failure to Claude.Hypothetical payload:
Matcher semantics: I'd believe similar as PreToolUse / PostToolUse — users can match by tool_name pattern, so you can scope hooks to Bash, WebFetch, MCP tools, etc. independently.
Fires: Once per failed attempt, including during retries (so observability tools see the full retry story), with a final event when the harness gives up vs. successfully retries.
Example configs this unlocks:
osascript -e 'display notification "Claude hit a 500"' — desktop ping on infra failure
Append to a local JSONL for session-level error telemetry
Post to Slack/Discord on repeated failures in a long-running /loop or scheduled agent
Play a song such as Killing in the Name Of at the 4M12S timestamp. (The original motivating use case. Completely Unserious, but a valid signal that the event surface is missing.)
Alternative Solutions
I had looked at PostToolUse, but it's unclear if it would be consistent on failed tool calls. Additionally, you'd likely have to string match which is brittle if things would change.
Priority
Medium - Would be very helpful
Feature Category
Configuration and settings
Use Case Example
This is mostly a joke, but I would likely not hesitate to set said hook up to either change the behavior around an error to like I mentioned above, play Killing in the Name Of at the 4m12s timestamp, such that Claude is telling me very clearly "No It won't do what I tell it" so that it gives me a good chuckle.
A more serious example, leveraging Claude for Triaging Incidents to pull new incidents, correlate them with deploys, check dashboards and structure a summary to an incident channel in Slack before on call is notified could result in broken or missing triage context, no message at all due to silent failures etc.
The hook for
ToolCallErrorcould reasonably unlock observability when Claude is running more autonomously, instead of producing silent failures that could lead to missing an SLA.Additional Context
No response