Skip to content

chore(webapp,run-engine): downgrade boundary log noise to warn#3462

Merged
ericallam merged 2 commits intomainfrom
chore/sentry-audit-noise-reduction
Apr 29, 2026
Merged

chore(webapp,run-engine): downgrade boundary log noise to warn#3462
ericallam merged 2 commits intomainfrom
chore/sentry-audit-noise-reduction

Conversation

@ericallam
Copy link
Copy Markdown
Member

Summary

Several boundary catches and customer-input validation paths were logging at error level for failures the system already handles gracefully — disconnect on auth failure, return undefined, skip retries, etc. This batch routes them to warn (which stays in stdout) or counts them as OTel metrics, so visibility is preserved without surfacing them as alerts.

Changes

New helper / pattern:

  • apiBuilder.server.tslogBoundaryError(message, error, url) inspects the inner error type at loader/action boundary catches; downgrades to warn for AbortError, ServiceValidationError, and EngineServiceValidationError.
  • platform.v3.server.tsplatform_client.failures_total OTel counter with {function, kind} labels; helper recordPlatformFailure(fn, kind) replaces the previous error-level logging across all BillingClient wrappers.

Log-level downgrades:

  • handleSocketIo.server.tsWorker authentication failed → warn (system disconnects on failure; refs TRI-8863)
  • waitpointSystem.ts — when runStatus === "CANCELED" in the suspended-without-checkpoint branch, skip the throw and warn instead (benign cancel-vs-resume race, nothing to resume)
  • runAttemptSystem.tsflushedMetadata parse/validate failures → warn (customer-side data shape, system returns gracefully)
  • batch-queue/index.ts — final-attempt failures with result.skipRetries → warn (callbacks already opted out of retry, e.g. queue size limit hit)
  • queryPerformanceMonitor.server.ts — slow queries → warn (observability signal, not an application error)
  • timeoutDeployment.server.ts — deployment-state mismatch in the timeout job → warn (timeout-vs-completion race)

Inner error preservation:

  • waitpointCompletionPacket.server.tslogger.error(uploadError) before throwing the ServiceValidationError wrapper, so the underlying upload error stays visible.

Why

The pattern across all of these is the same: a boundary log treated any thrown/returned error as error regardless of cause, even when the cause was an expected, system-handled condition (client disconnect, customer quota, race condition, schema validation of customer data). That made the logs noisy and made it harder to spot real bugs.

Where the underlying signal is still useful operationally (slow queries, billing call failures), we route it to OTel metrics with low-cardinality labels so dashboards and alerts can be tuned independently of error logs.

Test plan

  • pnpm run typecheck --filter webapp
  • pnpm run build --filter @internal/run-engine
  • Trigger a run on hello-world and verify task lifecycle is unaffected
  • Cancel a suspended run and verify the cancel-while-suspended branch in waitpointSystem.ts returns {status: "skipped"} instead of throwing
  • Confirm platform_client.failures_total counter shows up in metrics with {function, kind} labels when the billing client errors

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 28, 2026

⚠️ No Changeset found

Latest commit: bdd5a5a

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 67daf9d4-cce5-4d37-a6bb-18e535f0814e

📥 Commits

Reviewing files that changed from the base of the PR and between bdd5a5a and a5bc3ce.

📒 Files selected for processing (2)
  • apps/webapp/app/services/platform.v3.server.ts
  • internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts
  • apps/webapp/app/services/platform.v3.server.ts
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: units / e2e-webapp / 🧪 E2E Tests: Webapp
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: typecheck / typecheck
  • GitHub Check: sdk-compat / Cloudflare Workers

Walkthrough

The diff standardizes observability and error classification across the codebase: adds structured logging for object-store upload failures; introduces an OpenTelemetry meter and a platform_client.failures_total counter with a helper to record platform client failures; centralizes boundary-error formatting via logBoundaryError; downgrades several error logs to warnings for expected/racy conditions and slow queries; updates Sentry to ignore ServiceValidationError; adjusts batch/run-engine failure logging to distinguish non-retryable cases; and adds a Sentry filter verification script.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: downgrading boundary log noise from error to warn level across multiple files.
Description check ✅ Passed The description is comprehensive and complete, covering summary, changes, rationale, and test plan matching all required template sections.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/sentry-audit-noise-reduction

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Several boundary log sites and customer-input validation paths were
unconditionally logging at error level for failures the system already
handles gracefully (disconnect, retry-skip, return early). This batch
downgrades them to warn or counts them as metrics — visibility is
preserved in stdout / OTel metrics without surfacing them as alerts.

- apiBuilder.server.ts: logBoundaryError helper — warn for AbortError
  and ServiceValidationError at loader/action boundary catches
- handleSocketIo.server.ts: warn for "Worker authentication failed"
  (system disconnects on auth failure; refs TRI-8863)
- waitpointSystem.ts: skip throw and warn when run was canceled
  while suspended (benign cancel-vs-resume race, no checkpoint to
  resume from)
- runAttemptSystem.ts: warn for failed parse/validate of customer's
  flushedMetadata (system already returns gracefully)
- batch-queue/index.ts: warn for non-retryable batch item failures
  via result.skipRetries (queue size limit exceeded, etc.)
- queryPerformanceMonitor.server.ts: slow queries are observability,
  not errors — warn
- timeoutDeployment.server.ts: deployment-state mismatch is a benign
  timeout-vs-completion race — warn
- platform.v3.server.ts: platform_client.failures_total OTel counter
  with {function, kind} labels replaces logger.error from BillingClient
  call sites; helper recordPlatformFailure(fn, kind)
- waitpointCompletionPacket.server.ts: log inner uploadError before
  throwing the wrapper ServiceValidationError so the underlying error
  context isn't lost
@ericallam ericallam force-pushed the chore/sentry-audit-noise-reduction branch from 644147b to bdd5a5a Compare April 28, 2026 15:41
@ericallam ericallam marked this pull request as ready for review April 28, 2026 15:42
coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

- runAttemptSystem.ts: remove unreachable `if (packetError)` block.
  After `if (!packet) { return; }`, packet is truthy, which means
  packetError is null per tryCatch's contract. Pre-existing dead code.
- platform.v3.server.ts: fix stale comment that referenced logger.warn —
  the helper only counts a metric (no per-call logging by design).
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment thread apps/webapp/app/services/platform.v3.server.ts
@ericallam ericallam merged commit dac9c83 into main Apr 29, 2026
46 checks passed
@ericallam ericallam deleted the chore/sentry-audit-noise-reduction branch April 29, 2026 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants