Skip to content

feat: support interrupts from AfterToolCallEvent (#1165)#2178

Open
alisknox wants to merge 1 commit intostrands-agents:mainfrom
alisknox:feat/after-tool-call-interrupt
Open

feat: support interrupts from AfterToolCallEvent (#1165)#2178
alisknox wants to merge 1 commit intostrands-agents:mainfrom
alisknox:feat/after-tool-call-interrupt

Conversation

@alisknox
Copy link
Copy Markdown

@alisknox alisknox commented Apr 21, 2026

Add interrupt support to AfterToolCallEvent.

Description

Extend AfterToolCallEvent with the _Interruptible mixin, allowing after-tool hook callbacks to pause agent execution.

On resume, the after-tool hook is replayed with the same event data (tool_use, result, exception, cancel_message) that was present at the time of interruption. The callback receives the interrupt result via event.interrupt() return value, and can then modify the result or request a retry.

Tools do not re-execute on resume unless the replayed hook explicitly sets retry=True.

Related Issues

#1165 (Solves primary use-case, other events interruptible is still nice to have)

Documentation PR

Type of Change

New feature

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

Within SSDP, executed the interrupt in the after tool call handler. Successfully re-hydrated and continued with the same context instead of needing the event.retry flag to resume from the before-tool hook.

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Add interrupt support to AfterToolCallEvent, enabling human-in-the-loop
workflows after tool execution. This allows hook callbacks to pause agent
execution after inspecting a tool result (e.g., on failure) and resume
with a human response.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 90.74074% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/strands/event_loop/event_loop.py 80.00% 2 Missing and 3 partials ⚠️

📢 Thoughts on this report? Let us know!

@gautamsirdeshmukh
Copy link
Copy Markdown
Contributor

/strands review

)
replayed, new_interrupts = await agent.hooks.invoke_callbacks_async(original_event)
if new_interrupts:
interrupts.extend(new_interrupts)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Re-interrupt during replay loses the after-tool snapshot.

When the replay loop (L501-503) produces new_interrupts, these get added to the top-level interrupts list, which triggers the interrupt path at L589. However, after_tool_interrupt_events (L564) is only populated from the tool_events loop at L566-577 — snapshots for tools that re-interrupt during the replay loop are never added to after_tool_interrupt_events. This means _interrupt_state.context["after_tool_events"] won't contain the snapshot for the re-interrupted tool, so the next resume won't replay it.

Suggestion: When the replay loop encounters new_interrupts, re-add the snapshot to after_tool_interrupt_events:

if new_interrupts:
    interrupts.extend(new_interrupts)
    after_tool_interrupt_events.append(snapshot)
    continue

Note that after_tool_interrupt_events would need to be declared before the if agent._interrupt_state.activated block (move it up from L564) to be accessible in the replay loop.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Moved after_tool_interrupt_events declaration before the if agent._interrupt_state.activated block and added after_tool_interrupt_events.append(snapshot) when replay produces new interrupts. Added a test (test_after_tool_interrupt_re_interrupt_during_replay) that validates the snapshot is re-persisted for the next resume.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Moved after_tool_interrupt_events declaration before the if agent._interrupt_state.activated block and added after_tool_interrupt_events.append(snapshot) when replay produces new interrupts. Added test_after_tool_interrupt_re_interrupt_during_replay to validate.

for snapshot in after_tool_snapshots:
tool_name = snapshot["tool_use"]["name"]
tool_func = agent.tool_registry.dynamic_tools.get(tool_name) or agent.tool_registry.registry.get(tool_name)
original_exception = Exception(snapshot["exception"]) if snapshot.get("exception") else None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Original exception type is lost during serialization/deserialization.

The exception is stored as str(evt.exception) at L575 and restored as Exception(snapshot["exception"]) here. This means the original exception type (e.g., RuntimeError, ValueError) is replaced with a generic Exception. Callbacks that use isinstance checks on the exception type during replay will see different behavior than the original invocation.

Suggestion: Consider preserving the exception type name and restoring it, or at minimum document this limitation in the docstring. If type fidelity matters, you could store both:

"exception": str(evt.exception) if evt.exception else None,
"exception_type": type(evt.exception).__name__ if evt.exception else None,

Though for session persistence across processes, the generic Exception approach may be the pragmatic choice — just make sure the docstring calls out this behavior.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented this limitation in both the AfterToolCallEvent docstring and as an inline comment in the replay loop. The generic Exception approach is the pragmatic choice for session persistence across processes — the docstring now explicitly calls out the behavior.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented in both the AfterToolCallEvent docstring and the replay loop inline comments. Generic Exception is the pragmatic choice for session persistence — the docstring now explicitly calls out the behavior.

cancel_message=snapshot.get("cancel_message"),
)
replayed, new_interrupts = await agent.hooks.invoke_callbacks_async(original_event)
if new_interrupts:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Missing test coverage for the re-interrupt-during-replay path.

Codecov reports 80% patch coverage on this file with 2 missing lines and 3 partials. The path at L502-503 (replay produces new interrupts) appears untested. Given the bug identified above (snapshots not re-added for re-interrupts), this would be a valuable test case. A test for the "callback interrupts again during replay" scenario would both validate the fix and cover the gap.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test_after_tool_interrupt_re_interrupt_during_replay which covers this path and validates the bug fix from the first review comment.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test_after_tool_interrupt_re_interrupt_during_replay covering this path and validating the snapshot re-persistence fix.

interrupts.extend(new_interrupts)
continue
tool_use_id = original_event.tool_use["toolUseId"]
if getattr(replayed, "retry", False):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The _can_write method doesn't include "retry" as a writable property alongside interrupts.

Looking at the interaction between retry and interrupt: when both are set on the same event, interrupt raises InterruptException which short-circuits the callback before retry is typically read. The test test_after_tool_call_interrupt_takes_precedence_over_retry validates this. However, during replay, the callback can set retry=True after getting the interrupt response (without raising). This interplay seems intentional but should be documented more clearly — a comment in the replay logic noting that retry and interrupt responses can coexist on resume would help future readers.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an inline comment in the replay logic explaining that retry and interrupt responses can coexist on resume — the callback receives the human response (interrupt() returns instead of raising) and may then set retry=True.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added inline comment in the replay logic explaining that retry and interrupt responses can coexist on resume — interrupt() returns instead of raising when a response is available, then the callback may set retry=True.

"tool_use": evt.tool_use,
"result": evt.result,
"cancel_message": evt.cancel_message,
"exception": str(evt.exception) if evt.exception else None,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The exception field in the snapshot dictionary is inconsistent with the cancel_message and other snapshot fields.

Looking at the snapshot structure at L570-577, the exception key is always present (set to str(evt.exception) or None). But at L491, the code uses snapshot.get("exception") which suggests it might be absent. These two approaches are consistent in practice since get() with a missing key returns None, but it would be cleaner to check snapshot["exception"] directly (since the key is always set) or use snapshot.get("exception") consistently in both write and read paths.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to direct key access (snapshot["exception"], snapshot["cancel_message"]) since the keys are always set in the serialization path. Updated test fixtures to include the "exception" key to match.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to direct key access (snapshot["exception"], snapshot["cancel_message"]) since the keys are always set at serialization (L574-575). Updated test fixtures to include "exception" key to match.

@github-actions
Copy link
Copy Markdown

Assessment: Request Changes

This PR cleanly extends the interrupt system to AfterToolCallEvent, following existing patterns from BeforeToolCallEvent. The replay mechanism and source_event propagation are well-designed, and the test coverage across all four executor paths is thorough.

Review Categories
  • Correctness: The re-interrupt-during-replay path has a bug where after-tool snapshots are not re-persisted to after_tool_events, meaning a multi-round interrupt flow from the after-hook would lose context on the second resume.
  • Data Fidelity: Exception type information is lost during snapshot serialization — callbacks see generic Exception on replay instead of the original type. This may be an acceptable trade-off for session persistence but should be documented.
  • Test Coverage: The re-interrupt-during-replay path (L502-503) is untested, which is where the snapshot persistence bug lives. Adding a test here would both cover the gap and validate the fix.
  • Test Organization: Event loop test fixtures duplicate patterns from the executor conftest; extracting to a shared conftest would reduce maintenance burden.

The overall approach is solid — the _Interruptible mixin reuse, snapshot-based replay, and source_event plumbing are clean design choices.

@gautamsirdeshmukh
Copy link
Copy Markdown
Contributor

/strands please propose clear and detailed fixes for each area you have called out. You must ensure that this new pattern follows best practices and conventions established in the SDK.

@github-actions
Copy link
Copy Markdown

Review Fixes Summary

Addressed all 5 review findings in commit 0a25116:

1. Critical Bug — Re-interrupt during replay loses snapshot

Moved after_tool_interrupt_events declaration before the interrupt state block and added after_tool_interrupt_events.append(snapshot) when the replay loop encounters new interrupts. This ensures multi-round interrupt flows from the after-hook preserve context across resumes.

2. Exception type loss — documented

Added docstring and inline comments documenting that exception type fidelity is not preserved across resume boundaries (serialized as string, restored as generic Exception). This is the pragmatic choice for session persistence.

3. Missing test coverage — added

New test test_after_tool_interrupt_re_interrupt_during_replay validates the fix: callback receives response for step1, raises interrupt for step2, and the snapshot is re-persisted so the next resume can replay.

4. Retry+interrupt interplay — documented

Added inline comment explaining that on replay, retry=True and interrupt responses can coexist since interrupt() returns instead of raising when a response is available.

5. Inconsistent field access — fixed

Changed snapshot.get("exception") / snapshot.get("cancel_message") to direct key access since the serialization path always sets these keys. Updated test fixtures to include the "exception" key.

Tests: 856+ tests pass, 0 failures. Ruff, mypy clean.

@gautamsirdeshmukh
Copy link
Copy Markdown
Contributor

/strands great. please take one more review in case you may have missed any edge cases, micro-nits or other issues. additionally, once your review is complete, please present your findings clearly and include proposed code diffs.

@github-actions
Copy link
Copy Markdown

Comprehensive Review — Findings & Proposed Fixes

I did a line-by-line review of every file in this PR. Here are all findings, categorized by severity, with exact proposed diffs.


Finding 1 — BUG: Re-interrupt during replay loses the after-tool snapshot

File: src/strands/event_loop/event_loop.py L501-504

Problem: When the replay loop at L501 produces new_interrupts, these get added to the top-level interrupts list, which triggers the interrupt path at L589. However, after_tool_interrupt_events is declared at L564 — after the replay loop — so snapshots for tools that re-interrupt during replay are never added to after_tool_interrupt_events. This means _interrupt_state.context["after_tool_events"] won't contain the snapshot for the re-interrupted tool, and the next resume won't replay it.

Impact: Multi-round interrupt flows from after-tool hooks silently lose context after the first re-interrupt.

Fix: Move after_tool_interrupt_events declaration before the interrupt state block (L479) and re-add the snapshot when re-interrupt occurs:

-    interrupts = []
+    interrupts: list[Interrupt] = []
+    after_tool_interrupt_events: list[dict[str, Any]] = []
 
     if agent._interrupt_state.activated:
         ...
             replayed, new_interrupts = await agent.hooks.invoke_callbacks_async(original_event)
             if new_interrupts:
                 interrupts.extend(new_interrupts)
+                # Re-persist snapshot so the next resume can replay this hook again
+                after_tool_interrupt_events.append(snapshot)
                 continue
     ...
     tool_events = agent.tool_executor._execute(...)
-    after_tool_interrupt_events: list[dict[str, Any]] = []
     async for tool_event in tool_events:

This also requires adding from ..interrupt import Interrupt at the top of the file.


Finding 2 — BUG: Inconsistent snapshot field access will mask KeyErrors

File: src/strands/event_loop/event_loop.py L491, L499

Problem: The read path uses snapshot.get("exception") and snapshot.get("cancel_message") but the write path at L574-575 always sets both keys. Using .get() silently falls back to None if a snapshot is somehow malformed, hiding bugs. Conversely, if we switch to direct access (snapshot["exception"]), the test fixtures at test_after_tool_call_interrupt.py L166 and L222 break because they're missing the "exception" key.

Fix: Use direct access consistently since keys are always present, and fix test fixtures:

 # event_loop.py
-            original_exception = Exception(snapshot["exception"]) if snapshot.get("exception") else None
+            original_exception = Exception(snapshot["exception"]) if snapshot["exception"] else None
             ...
-                cancel_message=snapshot.get("cancel_message"),
+                cancel_message=snapshot["cancel_message"],

 # test_after_tool_call_interrupt.py (two locations)
-        "after_tool_events": [{"tool_use": tool_use, "result": original_result, "cancel_message": None}],
+        "after_tool_events": [
+            {"tool_use": tool_use, "result": original_result, "cancel_message": None, "exception": None}
+        ],

Finding 3 — COVERAGE: Missing test for re-interrupt-during-replay path

File: src/strands/event_loop/event_loop.py L502-503

Problem: Codecov reports this path uncovered. It's also where the bug from Finding 1 lives.

Fix: Add test_after_tool_interrupt_re_interrupt_during_replay:

@pytest.mark.asyncio
async def test_after_tool_interrupt_re_interrupt_during_replay(agent, model, tool, agenerator, alist):
    """When replay fires the after-hook and it interrupts again, the snapshot is re-persisted."""
    tool_use = {"toolUseId": "t1", "name": "tool_for_testing", "input": {"random_string": "hello"}}
    tool_use_message = {"role": "assistant", "content": [{"toolUse": tool_use}]}

    original_result = {"toolUseId": "t1", "status": "success", "content": [{"text": "hello"}]}
    original_snapshot = {
        "tool_use": tool_use, "result": original_result,
        "cancel_message": None, "exception": None,
    }

    first_interrupt = Interrupt(
        id="v1:after_tool_call:t1:d7758bb6-9bb1-5e1c-8919-341e240b4827",
        name="step1", reason="first check", response="CONTINUE",
    )

    agent._interrupt_state.context = {
        "tool_use_message": tool_use_message,
        "tool_results": [original_result],
        "after_tool_events": [original_snapshot],
    }
    agent._interrupt_state.interrupts[first_interrupt.id] = first_interrupt
    agent._interrupt_state.activate()

    def re_interrupt_on_replay(event):
        if isinstance(event, AfterToolCallEvent):
            event.interrupt("step1", reason="first check")  # returns "CONTINUE"
            event.interrupt("step2", reason="second check")  # raises InterruptException

    agent.hooks.add_callback(AfterToolCallEvent, re_interrupt_on_replay)
    model.stream.return_value = agenerator([{"contentBlockStop": {}}])

    stream = strands.event_loop.event_loop.event_loop_cycle(agent, invocation_state={})
    events = await alist(stream)

    assert events[-1]["stop"][0] == "interrupt"
    assert events[-1]["stop"][4][0].name == "step2"

    # Snapshot re-persisted for next resume
    ctx = agent._interrupt_state.context
    assert len(ctx["after_tool_events"]) == 1
    assert ctx["after_tool_events"][0]["tool_use"]["toolUseId"] == "t1"

Note: the interrupt ID d7758bb6-... is the correct UUID5 for name="step1" — verified via uuid.uuid5(uuid.NAMESPACE_OID, "step1").


Finding 4 — DOCS: Exception type loss not documented

File: src/strands/hooks/events.py L196-200

Problem: The AfterToolCallEvent docstring's Interrupts section doesn't mention that exception types are lost across resume boundaries. Callbacks that isinstance-check event.exception on replay will see Exception instead of RuntimeError, etc.

Fix: Extend the docstring:

     Interrupts:
         Hook callbacks can call ``event.interrupt(name, reason)`` to pause agent execution
         and request human input. The tool result is preserved and the tool will not be
         re-executed on resume. See :func:`strands.event_loop.event_loop._handle_tool_execution`
         for the replay mechanism.
 
+        On resume, the after-tool hook is replayed with the same event data (tool_use,
+        result, cancel_message) that was present at the time of interruption. The callback
+        receives the human response via ``event.interrupt()`` return value, and can then
+        modify the result or set ``retry=True`` to re-execute the tool.
+
+        Note: exception type fidelity is not preserved across resume boundaries.
+        The original exception is serialized as a string for session persistence and
+        restored as a generic ``Exception`` on replay. Callbacks that rely on
+        ``isinstance`` checks against specific exception types should account for this.

Also add a corresponding inline comment in the replay loop in event_loop.py.


Finding 5 — DOCS: Retry + interrupt interplay on replay undocumented

File: src/strands/event_loop/event_loop.py L506

Problem: During replay, a callback can receive the interrupt response (since interrupt() returns instead of raising) and then set retry=True. This retry+interrupt coexistence is intentional but not obvious to future readers.

Fix:

             tool_use_id = original_event.tool_use["toolUseId"]
+            # On replay, retry and interrupt responses can coexist: the callback receives
+            # the human response (interrupt() returns instead of raising) and may then
+            # decide to set retry=True to re-execute the tool with the new context.
             if getattr(replayed, "retry", False):

Summary

# Severity Category Description
1 Bug Correctness Re-interrupt during replay loses snapshot — next resume won't replay hook
2 Bug Consistency .get() masks malformed snapshots; test fixtures missing "exception" key
3 Coverage Testing Re-interrupt-during-replay path untested
4 Docs Clarity Exception type loss not documented in docstring
5 Docs Clarity Retry+interrupt coexistence on replay not commented

All fixes applied in commit 4ffe630. 856 tests pass, ruff/mypy clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants