feat: support interrupts from AfterToolCallEvent (#1165)#2178
feat: support interrupts from AfterToolCallEvent (#1165)#2178alisknox wants to merge 1 commit intostrands-agents:mainfrom
Conversation
Add interrupt support to AfterToolCallEvent, enabling human-in-the-loop workflows after tool execution. This allows hook callbacks to pause agent execution after inspecting a tool result (e.g., on failure) and resume with a human response.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
/strands review |
| ) | ||
| replayed, new_interrupts = await agent.hooks.invoke_callbacks_async(original_event) | ||
| if new_interrupts: | ||
| interrupts.extend(new_interrupts) |
There was a problem hiding this comment.
Issue: Re-interrupt during replay loses the after-tool snapshot.
When the replay loop (L501-503) produces new_interrupts, these get added to the top-level interrupts list, which triggers the interrupt path at L589. However, after_tool_interrupt_events (L564) is only populated from the tool_events loop at L566-577 — snapshots for tools that re-interrupt during the replay loop are never added to after_tool_interrupt_events. This means _interrupt_state.context["after_tool_events"] won't contain the snapshot for the re-interrupted tool, so the next resume won't replay it.
Suggestion: When the replay loop encounters new_interrupts, re-add the snapshot to after_tool_interrupt_events:
if new_interrupts:
interrupts.extend(new_interrupts)
after_tool_interrupt_events.append(snapshot)
continueNote that after_tool_interrupt_events would need to be declared before the if agent._interrupt_state.activated block (move it up from L564) to be accessible in the replay loop.
There was a problem hiding this comment.
Fixed. Moved after_tool_interrupt_events declaration before the if agent._interrupt_state.activated block and added after_tool_interrupt_events.append(snapshot) when replay produces new interrupts. Added a test (test_after_tool_interrupt_re_interrupt_during_replay) that validates the snapshot is re-persisted for the next resume.
There was a problem hiding this comment.
Fixed. Moved after_tool_interrupt_events declaration before the if agent._interrupt_state.activated block and added after_tool_interrupt_events.append(snapshot) when replay produces new interrupts. Added test_after_tool_interrupt_re_interrupt_during_replay to validate.
| for snapshot in after_tool_snapshots: | ||
| tool_name = snapshot["tool_use"]["name"] | ||
| tool_func = agent.tool_registry.dynamic_tools.get(tool_name) or agent.tool_registry.registry.get(tool_name) | ||
| original_exception = Exception(snapshot["exception"]) if snapshot.get("exception") else None |
There was a problem hiding this comment.
Issue: Original exception type is lost during serialization/deserialization.
The exception is stored as str(evt.exception) at L575 and restored as Exception(snapshot["exception"]) here. This means the original exception type (e.g., RuntimeError, ValueError) is replaced with a generic Exception. Callbacks that use isinstance checks on the exception type during replay will see different behavior than the original invocation.
Suggestion: Consider preserving the exception type name and restoring it, or at minimum document this limitation in the docstring. If type fidelity matters, you could store both:
"exception": str(evt.exception) if evt.exception else None,
"exception_type": type(evt.exception).__name__ if evt.exception else None,Though for session persistence across processes, the generic Exception approach may be the pragmatic choice — just make sure the docstring calls out this behavior.
There was a problem hiding this comment.
Documented this limitation in both the AfterToolCallEvent docstring and as an inline comment in the replay loop. The generic Exception approach is the pragmatic choice for session persistence across processes — the docstring now explicitly calls out the behavior.
There was a problem hiding this comment.
Documented in both the AfterToolCallEvent docstring and the replay loop inline comments. Generic Exception is the pragmatic choice for session persistence — the docstring now explicitly calls out the behavior.
| cancel_message=snapshot.get("cancel_message"), | ||
| ) | ||
| replayed, new_interrupts = await agent.hooks.invoke_callbacks_async(original_event) | ||
| if new_interrupts: |
There was a problem hiding this comment.
Issue: Missing test coverage for the re-interrupt-during-replay path.
Codecov reports 80% patch coverage on this file with 2 missing lines and 3 partials. The path at L502-503 (replay produces new interrupts) appears untested. Given the bug identified above (snapshots not re-added for re-interrupts), this would be a valuable test case. A test for the "callback interrupts again during replay" scenario would both validate the fix and cover the gap.
There was a problem hiding this comment.
Added test_after_tool_interrupt_re_interrupt_during_replay which covers this path and validates the bug fix from the first review comment.
There was a problem hiding this comment.
Added test_after_tool_interrupt_re_interrupt_during_replay covering this path and validating the snapshot re-persistence fix.
| interrupts.extend(new_interrupts) | ||
| continue | ||
| tool_use_id = original_event.tool_use["toolUseId"] | ||
| if getattr(replayed, "retry", False): |
There was a problem hiding this comment.
Issue: The _can_write method doesn't include "retry" as a writable property alongside interrupts.
Looking at the interaction between retry and interrupt: when both are set on the same event, interrupt raises InterruptException which short-circuits the callback before retry is typically read. The test test_after_tool_call_interrupt_takes_precedence_over_retry validates this. However, during replay, the callback can set retry=True after getting the interrupt response (without raising). This interplay seems intentional but should be documented more clearly — a comment in the replay logic noting that retry and interrupt responses can coexist on resume would help future readers.
There was a problem hiding this comment.
Added an inline comment in the replay logic explaining that retry and interrupt responses can coexist on resume — the callback receives the human response (interrupt() returns instead of raising) and may then set retry=True.
There was a problem hiding this comment.
Added inline comment in the replay logic explaining that retry and interrupt responses can coexist on resume — interrupt() returns instead of raising when a response is available, then the callback may set retry=True.
| "tool_use": evt.tool_use, | ||
| "result": evt.result, | ||
| "cancel_message": evt.cancel_message, | ||
| "exception": str(evt.exception) if evt.exception else None, |
There was a problem hiding this comment.
Issue: The exception field in the snapshot dictionary is inconsistent with the cancel_message and other snapshot fields.
Looking at the snapshot structure at L570-577, the exception key is always present (set to str(evt.exception) or None). But at L491, the code uses snapshot.get("exception") which suggests it might be absent. These two approaches are consistent in practice since get() with a missing key returns None, but it would be cleaner to check snapshot["exception"] directly (since the key is always set) or use snapshot.get("exception") consistently in both write and read paths.
There was a problem hiding this comment.
Switched to direct key access (snapshot["exception"], snapshot["cancel_message"]) since the keys are always set in the serialization path. Updated test fixtures to include the "exception" key to match.
There was a problem hiding this comment.
Switched to direct key access (snapshot["exception"], snapshot["cancel_message"]) since the keys are always set at serialization (L574-575). Updated test fixtures to include "exception" key to match.
|
Assessment: Request Changes This PR cleanly extends the interrupt system to Review Categories
The overall approach is solid — the |
|
/strands please propose clear and detailed fixes for each area you have called out. You must ensure that this new pattern follows best practices and conventions established in the SDK. |
Review Fixes SummaryAddressed all 5 review findings in commit 1. Critical Bug — Re-interrupt during replay loses snapshotMoved 2. Exception type loss — documentedAdded docstring and inline comments documenting that exception type fidelity is not preserved across resume boundaries (serialized as string, restored as generic 3. Missing test coverage — addedNew test 4. Retry+interrupt interplay — documentedAdded inline comment explaining that on replay, 5. Inconsistent field access — fixedChanged Tests: 856+ tests pass, 0 failures. Ruff, mypy clean. |
|
/strands great. please take one more review in case you may have missed any edge cases, micro-nits or other issues. additionally, once your review is complete, please present your findings clearly and include proposed code diffs. |
Comprehensive Review — Findings & Proposed FixesI did a line-by-line review of every file in this PR. Here are all findings, categorized by severity, with exact proposed diffs. Finding 1 — BUG: Re-interrupt during replay loses the after-tool snapshotFile: Problem: When the replay loop at L501 produces Impact: Multi-round interrupt flows from after-tool hooks silently lose context after the first re-interrupt. Fix: Move - interrupts = []
+ interrupts: list[Interrupt] = []
+ after_tool_interrupt_events: list[dict[str, Any]] = []
if agent._interrupt_state.activated:
...
replayed, new_interrupts = await agent.hooks.invoke_callbacks_async(original_event)
if new_interrupts:
interrupts.extend(new_interrupts)
+ # Re-persist snapshot so the next resume can replay this hook again
+ after_tool_interrupt_events.append(snapshot)
continue
...
tool_events = agent.tool_executor._execute(...)
- after_tool_interrupt_events: list[dict[str, Any]] = []
async for tool_event in tool_events:This also requires adding Finding 2 — BUG: Inconsistent snapshot field access will mask KeyErrorsFile: Problem: The read path uses Fix: Use direct access consistently since keys are always present, and fix test fixtures: # event_loop.py
- original_exception = Exception(snapshot["exception"]) if snapshot.get("exception") else None
+ original_exception = Exception(snapshot["exception"]) if snapshot["exception"] else None
...
- cancel_message=snapshot.get("cancel_message"),
+ cancel_message=snapshot["cancel_message"],
# test_after_tool_call_interrupt.py (two locations)
- "after_tool_events": [{"tool_use": tool_use, "result": original_result, "cancel_message": None}],
+ "after_tool_events": [
+ {"tool_use": tool_use, "result": original_result, "cancel_message": None, "exception": None}
+ ],Finding 3 — COVERAGE: Missing test for re-interrupt-during-replay pathFile: Problem: Codecov reports this path uncovered. It's also where the bug from Finding 1 lives. Fix: Add @pytest.mark.asyncio
async def test_after_tool_interrupt_re_interrupt_during_replay(agent, model, tool, agenerator, alist):
"""When replay fires the after-hook and it interrupts again, the snapshot is re-persisted."""
tool_use = {"toolUseId": "t1", "name": "tool_for_testing", "input": {"random_string": "hello"}}
tool_use_message = {"role": "assistant", "content": [{"toolUse": tool_use}]}
original_result = {"toolUseId": "t1", "status": "success", "content": [{"text": "hello"}]}
original_snapshot = {
"tool_use": tool_use, "result": original_result,
"cancel_message": None, "exception": None,
}
first_interrupt = Interrupt(
id="v1:after_tool_call:t1:d7758bb6-9bb1-5e1c-8919-341e240b4827",
name="step1", reason="first check", response="CONTINUE",
)
agent._interrupt_state.context = {
"tool_use_message": tool_use_message,
"tool_results": [original_result],
"after_tool_events": [original_snapshot],
}
agent._interrupt_state.interrupts[first_interrupt.id] = first_interrupt
agent._interrupt_state.activate()
def re_interrupt_on_replay(event):
if isinstance(event, AfterToolCallEvent):
event.interrupt("step1", reason="first check") # returns "CONTINUE"
event.interrupt("step2", reason="second check") # raises InterruptException
agent.hooks.add_callback(AfterToolCallEvent, re_interrupt_on_replay)
model.stream.return_value = agenerator([{"contentBlockStop": {}}])
stream = strands.event_loop.event_loop.event_loop_cycle(agent, invocation_state={})
events = await alist(stream)
assert events[-1]["stop"][0] == "interrupt"
assert events[-1]["stop"][4][0].name == "step2"
# Snapshot re-persisted for next resume
ctx = agent._interrupt_state.context
assert len(ctx["after_tool_events"]) == 1
assert ctx["after_tool_events"][0]["tool_use"]["toolUseId"] == "t1"Note: the interrupt ID Finding 4 — DOCS: Exception type loss not documentedFile: Problem: The Fix: Extend the docstring: Interrupts:
Hook callbacks can call ``event.interrupt(name, reason)`` to pause agent execution
and request human input. The tool result is preserved and the tool will not be
re-executed on resume. See :func:`strands.event_loop.event_loop._handle_tool_execution`
for the replay mechanism.
+ On resume, the after-tool hook is replayed with the same event data (tool_use,
+ result, cancel_message) that was present at the time of interruption. The callback
+ receives the human response via ``event.interrupt()`` return value, and can then
+ modify the result or set ``retry=True`` to re-execute the tool.
+
+ Note: exception type fidelity is not preserved across resume boundaries.
+ The original exception is serialized as a string for session persistence and
+ restored as a generic ``Exception`` on replay. Callbacks that rely on
+ ``isinstance`` checks against specific exception types should account for this.Also add a corresponding inline comment in the replay loop in Finding 5 — DOCS: Retry + interrupt interplay on replay undocumentedFile: Problem: During replay, a callback can receive the interrupt response (since Fix: tool_use_id = original_event.tool_use["toolUseId"]
+ # On replay, retry and interrupt responses can coexist: the callback receives
+ # the human response (interrupt() returns instead of raising) and may then
+ # decide to set retry=True to re-execute the tool with the new context.
if getattr(replayed, "retry", False):Summary
All fixes applied in commit |
Add interrupt support to AfterToolCallEvent.
Description
Extend AfterToolCallEvent with the _Interruptible mixin, allowing after-tool hook callbacks to pause agent execution.
On resume, the after-tool hook is replayed with the same event data (tool_use, result, exception, cancel_message) that was present at the time of interruption. The callback receives the interrupt result via event.interrupt() return value, and can then modify the result or request a retry.
Tools do not re-execute on resume unless the replayed hook explicitly sets retry=True.
Related Issues
#1165 (Solves primary use-case, other events interruptible is still nice to have)
Documentation PR
Type of Change
New feature
Testing
How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli
Within SSDP, executed the interrupt in the after tool call handler. Successfully re-hydrated and continued with the same context instead of needing the event.retry flag to resume from the before-tool hook.
hatch run prepareChecklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.