Skip to content

[FEATURE] No hook event for upstream tool/infra failures (5xx, network errors) #48650

@st4nst4nst4n

Description

@st4nst4nst4n

Preflight Checklist

  • I have searched existing requests and this feature hasn't been requested yet
  • This is a single feature request (not multiple features)

Problem Statement

Claude Code's hook system exposes events tied to Claude's actions — PreToolUse, PostToolUse, Stop, UserPromptSubmit, etc. — but nothing fires when the harness itself fails: upstream 5xx responses, network errors, tool timeouts, or API rate limits surfaced mid-session.

Today these failures are only visible as error strings inside a tool result that Claude happens to read and relay. That makes them:

Invisible to automation. Users can't wire telemetry, paging, auto-retry-with-backoff, or desktop notifications to infra failures, because there's no event to hook.
Inconsistently surfaced. Whether a user learns their session hit a 500 depends on whether Claude decides to mention it in the response, which is non-deterministic.
Non-actionable for teams. Power users running Claude Code in CI, long-running agents, or /loop-style workflows have no way to distinguish "Claude failed the task" from "the harness failed Claude" without scraping transcripts.
What would solve it:

A hook event (e.g. ToolCallError or HarnessError) that fires on non-2xx responses from the tool infrastructure, network failures, and timeouts — with the tool name, error class, status code, and retry count in the payload. Same shape as existing hooks so it composes with current configs.

Why now:

As Claude Code gets used for longer, more autonomous workflows (scheduled agents, /loop, CI runs), silent infra failures become a correctness and observability problem, not just a UX papercut.

Proposed Solution

A new hook event, potentially ToolCallError. Fires whenever the harness fails to get a successful response from a tool. Can be prior to or instead of surfacing the failure to Claude.

Hypothetical payload:

{
  "hook_event_name": "ToolCallError",
  "tool_name": "Bash",
  "tool_input": { ... },
  "error": {
    "kind": "http_5xx" | "network" | "timeout" | "rate_limit" | "other",
    "status_code": 500,
    "message": "Internal Server Error",
    "retry_attempt": 2,
    "will_retry": true
  },
  "session_id": "...",
  "transcript_path": "..."
}

Matcher semantics: I'd believe similar as PreToolUse / PostToolUse — users can match by tool_name pattern, so you can scope hooks to Bash, WebFetch, MCP tools, etc. independently.

Fires: Once per failed attempt, including during retries (so observability tools see the full retry story), with a final event when the harness gives up vs. successfully retries.

Example configs this unlocks:

osascript -e 'display notification "Claude hit a 500"' — desktop ping on infra failure
Append to a local JSONL for session-level error telemetry
Post to Slack/Discord on repeated failures in a long-running /loop or scheduled agent
Play a song such as Killing in the Name Of at the 4M12S timestamp. (The original motivating use case. Completely Unserious, but a valid signal that the event surface is missing.)

Alternative Solutions

I had looked at PostToolUse, but it's unclear if it would be consistent on failed tool calls. Additionally, you'd likely have to string match which is brittle if things would change.

Priority

Medium - Would be very helpful

Feature Category

Configuration and settings

Use Case Example

This is mostly a joke, but I would likely not hesitate to set said hook up to either change the behavior around an error to like I mentioned above, play Killing in the Name Of at the 4m12s timestamp, such that Claude is telling me very clearly "No It won't do what I tell it" so that it gives me a good chuckle.

A more serious example, leveraging Claude for Triaging Incidents to pull new incidents, correlate them with deploys, check dashboards and structure a summary to an incident channel in Slack before on call is notified could result in broken or missing triage context, no message at all due to silent failures etc.

The hook for ToolCallError could reasonably unlock observability when Claude is running more autonomously, instead of producing silent failures that could lead to missing an SLA.

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions