[FEATURE] No hook event for upstream tool/infra failures (5xx, network errors)

### Preflight Checklist

- [x] I have searched [existing requests](https://github.com/anthropics/claude-code/issues?q=is%3Aissue%20label%3Aenhancement) and this feature hasn't been requested yet
- [x] This is a single feature request (not multiple features)

### Problem Statement

Claude Code's hook system exposes events tied to Claude's actions — PreToolUse, PostToolUse, Stop, UserPromptSubmit, etc. — but nothing fires when the harness itself fails: upstream 5xx responses, network errors, tool timeouts, or API rate limits surfaced mid-session.

Today these failures are only visible as error strings inside a tool result that Claude happens to read and relay. That makes them:

Invisible to automation. Users can't wire telemetry, paging, auto-retry-with-backoff, or desktop notifications to infra failures, because there's no event to hook.
Inconsistently surfaced. Whether a user learns their session hit a 500 depends on whether Claude decides to mention it in the response, which is non-deterministic.
Non-actionable for teams. Power users running Claude Code in CI, long-running agents, or /loop-style workflows have no way to distinguish "Claude failed the task" from "the harness failed Claude" without scraping transcripts.
What would solve it:

A hook event (e.g. ToolCallError or HarnessError) that fires on non-2xx responses from the tool infrastructure, network failures, and timeouts — with the tool name, error class, status code, and retry count in the payload. Same shape as existing hooks so it composes with current configs.

Why now:

As Claude Code gets used for longer, more autonomous workflows (scheduled agents, /loop, CI runs), silent infra failures become a correctness and observability problem, not just a UX papercut.

### Proposed Solution

A new hook event, potentially `ToolCallError`. Fires whenever the harness fails to get a successful response from a tool. Can be prior to or instead of surfacing the failure to Claude.

Hypothetical payload:

```
{
  "hook_event_name": "ToolCallError",
  "tool_name": "Bash",
  "tool_input": { ... },
  "error": {
    "kind": "http_5xx" | "network" | "timeout" | "rate_limit" | "other",
    "status_code": 500,
    "message": "Internal Server Error",
    "retry_attempt": 2,
    "will_retry": true
  },
  "session_id": "...",
  "transcript_path": "..."
}
```
Matcher semantics: I'd believe similar as PreToolUse / PostToolUse — users can match by tool_name pattern, so you can scope hooks to Bash, WebFetch, MCP tools, etc. independently.

Fires: Once per failed attempt, including during retries (so observability tools see the full retry story), with a final event when the harness gives up vs. successfully retries.

Example configs this unlocks:

osascript -e 'display notification "Claude hit a 500"' — desktop ping on infra failure
Append to a local JSONL for session-level error telemetry
Post to Slack/Discord on repeated failures in a long-running /loop or scheduled agent
Play a song such as Killing in the Name Of at the 4M12S timestamp. (The original motivating use case. Completely Unserious, but a valid signal that the event surface is missing.)

### Alternative Solutions

I had looked at PostToolUse, but it's unclear if it would be consistent on failed tool calls. Additionally, you'd likely have to string match which is brittle if things would change.

### Priority

Medium - Would be very helpful

### Feature Category

Configuration and settings

### Use Case Example

This is mostly a joke, but I would likely not hesitate to set said hook up to either change the behavior around an error to like I mentioned above, play Killing in the Name Of at the 4m12s timestamp, such that Claude is telling me very clearly "No It won't do what I tell it" so that it gives me a good chuckle.

A more serious example, leveraging Claude for Triaging Incidents to pull new incidents, correlate them with deploys, check dashboards and structure a summary to an incident channel in Slack before on call is notified could result in broken or missing triage context, no message at all due to silent failures etc. 

The hook for `ToolCallError` could reasonably unlock observability when Claude is running more autonomously, instead of producing silent failures that could lead to missing an SLA.

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] No hook event for upstream tool/infra failures (5xx, network errors) #48650

Preflight Checklist

Problem Statement

Proposed Solution

Alternative Solutions

Priority

Feature Category

Use Case Example

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[FEATURE] No hook event for upstream tool/infra failures (5xx, network errors) #48650

Description

Preflight Checklist

Problem Statement

Proposed Solution

Alternative Solutions

Priority

Feature Category

Use Case Example

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions