Skip to content

fix: make local mode health-check timeout configurable via session config#5778

Open
egorkosaretsky wants to merge 1 commit intoaws:masterfrom
egorkosaretsky:fix-local-mode-health-check-timeout
Open

fix: make local mode health-check timeout configurable via session config#5778
egorkosaretsky wants to merge 1 commit intoaws:masterfrom
egorkosaretsky:fix-local-mode-health-check-timeout

Conversation

@egorkosaretsky
Copy link
Copy Markdown

Problem

Local mode endpoint deployment uses a hard-coded HEALTH_CHECK_TIMEOUT_LIMIT of 120s (entities.py:43). This causes failures when containers take longer to start — e.g. large model archives (~5GB+), slow pip installs, or low-bandwidth environments — even though the actual SageMaker endpoint deployment succeeds fine.

Fixes #3362.

Solution

  • Added an optional timeout parameter to _wait_for_serving_container() (defaults to HEALTH_CHECK_TIMEOUT_LIMIT for backwards compatibility)
  • Both _LocalEndpoint.serve() and _LocalTransformJob.start() now read local.health_check_timeout from the session config and pass it through

Usage:

sess = LocalSession()
sess.config = {'local': {'health_check_timeout': 600}}  # 10 minutes

Testing

Added unit tests covering:

  • Custom timeout value is respected by _wait_for_serving_container
  • health_check_timeout from session config is passed through in _LocalEndpoint.serve()
  • Default behaviour (120s) is unchanged when config is not set

Checklist

  • Unit tests added/updated
  • Backwards compatible (default unchanged)
  • Both endpoint and transform job code paths updated

…nfig (aws#3362)

Allow users to set `local.health_check_timeout` in their session config to override
the hard-coded 120s limit, fixing failures when large model archives or slow networks
cause container startup to exceed the default.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@egorkosaretsky
Copy link
Copy Markdown
Author

@nargokul Could you please take a look, this should fix #3362

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Configurable (or just much longer?) health-check timeout in local mode

1 participant