You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
chore(webapp,run-engine): downgrade boundary log noise to warn (#3462)
## Summary
Several boundary catches and customer-input validation paths were
logging at `error` level for failures the system already handles
gracefully — disconnect on auth failure, return undefined, skip retries,
etc. This batch routes them to `warn` (which stays in stdout) or counts
them as OTel metrics, so visibility is preserved without surfacing them
as alerts.
## Changes
**New helper / pattern:**
- `apiBuilder.server.ts` — `logBoundaryError(message, error, url)`
inspects the inner error type at loader/action boundary catches;
downgrades to `warn` for `AbortError`, `ServiceValidationError`, and
`EngineServiceValidationError`.
- `platform.v3.server.ts` — `platform_client.failures_total` OTel
counter with `{function, kind}` labels; helper
`recordPlatformFailure(fn, kind)` replaces the previous error-level
logging across all `BillingClient` wrappers.
**Log-level downgrades:**
- `handleSocketIo.server.ts` — `Worker authentication failed` → warn
(system disconnects on failure; refs TRI-8863)
- `waitpointSystem.ts` — when `runStatus === "CANCELED"` in the
suspended-without-checkpoint branch, skip the throw and warn instead
(benign cancel-vs-resume race, nothing to resume)
- `runAttemptSystem.ts` — `flushedMetadata` parse/validate failures →
warn (customer-side data shape, system returns gracefully)
- `batch-queue/index.ts` — final-attempt failures with
`result.skipRetries` → warn (callbacks already opted out of retry, e.g.
queue size limit hit)
- `queryPerformanceMonitor.server.ts` — slow queries → warn
(observability signal, not an application error)
- `timeoutDeployment.server.ts` — deployment-state mismatch in the
timeout job → warn (timeout-vs-completion race)
**Inner error preservation:**
- `waitpointCompletionPacket.server.ts` — `logger.error(uploadError)`
before throwing the `ServiceValidationError` wrapper, so the underlying
upload error stays visible.
## Why
The pattern across all of these is the same: a boundary log treated any
thrown/returned error as `error` regardless of cause, even when the
cause was an expected, system-handled condition (client disconnect,
customer quota, race condition, schema validation of customer data).
That made the logs noisy and made it harder to spot real bugs.
Where the underlying signal is still useful operationally (slow queries,
billing call failures), we route it to OTel metrics with low-cardinality
labels so dashboards and alerts can be tuned independently of error
logs.
## Test plan
- [ ] `pnpm run typecheck --filter webapp`
- [ ] `pnpm run build --filter @internal/run-engine`
- [ ] Trigger a run on hello-world and verify task lifecycle is
unaffected
- [ ] Cancel a suspended run and verify the cancel-while-suspended
branch in `waitpointSystem.ts` returns `{status: "skipped"}` instead of
throwing
- [ ] Confirm `platform_client.failures_total` counter shows up in
metrics with `{function, kind}` labels when the billing client errors
0 commit comments