feat: add CodeArtifact support for ModelTrainer and FrameworkProcessor requirements.txt installation#5772
Open
humanzz wants to merge 1 commit intoaws:masterfrom
Open
feat: add CodeArtifact support for ModelTrainer and FrameworkProcessor requirements.txt installation#5772humanzz wants to merge 1 commit intoaws:masterfrom
humanzz wants to merge 1 commit intoaws:masterfrom
Conversation
…r requirements.txt installation SDK v3's `ModelTrainer` and `FrameworkProcessor` override the container entrypoint with SDK-generated scripts (`sm_train.sh`, `runproc.sh`), bypassing the container's entrypoint which involved `sagemaker-training-toolkit` handling `CA_REPOSITORY_ARN`-based CodeArtifact authentication. This broke CodeArtifact support for both training (Bug 4) and processing (Bug 3) reported in aws#5765. This is the stopgap solution proposed in this comment[aws#5765 (comment)]: a self-contained install_requirements.py script that the SDK uploads to the container alongside its generated entrypoint scripts. - Add `install_requirements.py` in sagemaker-core — reads `CA_REPOSITORY_ARN` from container environment; no-op if unset - Try `boto3` first (matching sagemaker-training-toolkit), fall back to `AWS CLI`, hard-fail if neither is available - Wire into `ModelTrainer`: copy script into `sm_drivers/scripts/`, update `INSTALL_REQUIREMENTS` templates to call it instead of bare `pip install` - Wire into `FrameworkProcessor`: upload script as sibling file alongside `runproc.sh`, update generated script to call it
Contributor
Author
|
I've also tested this with my code (as an integration test) to verify the behaviours Training Job (
|
aviruthen
approved these changes
Apr 17, 2026
zhaoqizqwang
approved these changes
Apr 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SDK v3's
ModelTrainerandFrameworkProcessoroverride the container entrypoint with SDK-generated scripts (sm_train.sh,runproc.sh), bypassing the container's entrypoint which involvedsagemaker-training-toolkithandlingCA_REPOSITORY_ARN-based CodeArtifact authentication. This broke CodeArtifact support for both training (Bug 4) and processing (Bug 3) reported in #5765.This is the stopgap solution proposed in this comment: a self-contained install_requirements.py script that the SDK uploads to the container alongside its generated entrypoint scripts.
install_requirements.pyin sagemaker-core — readsCA_REPOSITORY_ARNfrom container environment; no-op if unsetboto3first (matching sagemaker-training-toolkit), fall back toAWS CLI, hard-fail if neither is availableModelTrainer: copy script intosm_drivers/scripts/, updateINSTALL_REQUIREMENTStemplates to call it instead of barepip installFrameworkProcessor: upload script as sibling file alongsiderunproc.sh, update generated script to call itIssue #, if available:
#5765
Description of changes:
When the SDK overrides a container's entrypoint — as
ModelTrainerdoes for training jobs (Bug 4) andFrameworkProcessordoes for processing jobs (Bug 3) — the container's nativesagemaker-training-toolkitis bypassed. This toolkit handledCA_REPOSITORY_ARN-based CodeArtifact authentication forrequirements.txtinstallation via boto3. Without it,pip install -r requirements.txtruns against public PyPI, failing in VPC-isolated environments or when packages are only available in a private CodeArtifact repository.See #5765 and the detailed analysis comment for full context.
Solution: Stopgap
install_requirements.pyA self-contained Python script in
sagemaker-corethat handles CodeArtifact authentication before installing requirements. It:CA_REPOSITORY_ARNfrom the container environment — if not set, does a normalpip installsagemaker-training-toolkit's approach) to build an authenticated pip index URLaws codeartifact login --tool pip) if boto3 is unavailableCA_REPOSITORY_ARNis set but neither boto3 nor AWS CLI is availableThe script can be used as:
python install_requirements.py requirements.txt(used by bash-based entrypoints)from sagemaker.core.utils.install_requirements import configure_pip, install_requirements(for Python-native callers like@remoteorModelBuilder)Changes
sagemaker-core/.../utils/install_requirements.pyconfigure_pip(),install_requirements(),main(), andCodeArtifactAuthMethodenumsagemaker-core/tests/unit/test_install_requirements.pysagemaker-train/.../templates.pyINSTALL_REQUIREMENTSandINSTALL_AUTO_REQUIREMENTSnow callinstall_requirements.pyinstead of barepip installsagemaker-train/.../model_trainer.pyinstall_requirements.pyfrom sagemaker-core intosm_drivers/scripts/at runtimesagemaker-core/.../processing.pyinstall_requirements.pyas sibling file alongsiderunproc.shandsourcedir.tar.gz; update generated script to call itsagemaker-core/tests/unit/test_processing.pyinstall_requirements.pyis uploaded and referenced in generated scriptWhat this covers
ModelTrainerinstall_requirements.pyinsm_drivers/scripts/FrameworkProcessorinstall_requirements.pyuploaded as sibling fileTunerModelBuilderWhat this does NOT cover
@remotefunction (runtime_environment_manager.py)_install_requirements_txt()that does barepip install. Could useconfigure_pip()via import.sagemaker-serve(requirements_manager.py)pip installin-process. Could importconfigure_pip().sagemaker-core/modules(templates.py)sagemaker-train/templates.pywithoutINSTALL_REQUIREMENTS. Lower priority.These are follow-up opportunities — the module is available for them to import.
Known risks
Tuning jobs depend on the container's toolkit — The
CreateHyperParameterTuningJobAPI usesHyperParameterAlgorithmSpecificationwhich lacksContainerEntrypoint, so the Tuner cannot usesm_train.sh. If future containers dropsagemaker-training-toolkit, tuning jobs will lose CodeArtifact support with no SDK-side fix possible until the API adds entrypoint support.boto3 availability in future containers — Current PyTorch training containers (2.7–2.9) include boto3. New DLC base images on the
mainbranch do not. The script's fallback to AWS CLI mitigates this, but if neither is available, the script hard-fails. The long-term solution is a shared package with boto3 as a declared dependency (see analysis).Long-term solution
This PR is a stopgap that works within the SDK alone. The long-term solution requires coordination between the SDK and DLC to ensure that both the container's default entrypoint and any SDK-overridden entrypoint have access to the same CodeArtifact-aware installer — ideally a shared package with boto3 as a declared dependency, installed in all SageMaker containers. See the proposed ideal solution for details.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.