Skip to content

feat: add CodeArtifact support for ModelTrainer and FrameworkProcessor requirements.txt installation#5772

Open
humanzz wants to merge 1 commit intoaws:masterfrom
humanzz:codeartifact-fix
Open

feat: add CodeArtifact support for ModelTrainer and FrameworkProcessor requirements.txt installation#5772
humanzz wants to merge 1 commit intoaws:masterfrom
humanzz:codeartifact-fix

Conversation

@humanzz
Copy link
Copy Markdown
Contributor

@humanzz humanzz commented Apr 17, 2026

SDK v3's ModelTrainer and FrameworkProcessor override the container entrypoint with SDK-generated scripts (sm_train.sh, runproc.sh), bypassing the container's entrypoint which involved sagemaker-training-toolkit handling
CA_REPOSITORY_ARN-based CodeArtifact authentication. This broke CodeArtifact support for both training (Bug 4) and processing (Bug 3) reported in #5765.

This is the stopgap solution proposed in this comment: a self-contained install_requirements.py script that the SDK uploads to the container alongside its generated entrypoint scripts.

  • Add install_requirements.py in sagemaker-core — reads CA_REPOSITORY_ARN from container environment; no-op if unset
  • Try boto3 first (matching sagemaker-training-toolkit), fall back to AWS CLI, hard-fail if neither is available
  • Wire into ModelTrainer: copy script into sm_drivers/scripts/, update INSTALL_REQUIREMENTS templates to call it instead of bare pip install
  • Wire into FrameworkProcessor: upload script as sibling file alongside runproc.sh, update generated script to call it

Issue #, if available:

#5765

Description of changes:

When the SDK overrides a container's entrypoint — as ModelTrainer does for training jobs (Bug 4) and FrameworkProcessor does for processing jobs (Bug 3) — the container's native sagemaker-training-toolkit is bypassed. This toolkit handled CA_REPOSITORY_ARN-based CodeArtifact authentication for requirements.txt installation via boto3. Without it, pip install -r requirements.txt runs against public PyPI, failing in VPC-isolated environments or when packages are only available in a private CodeArtifact repository.

See #5765 and the detailed analysis comment for full context.

Solution: Stopgap install_requirements.py

A self-contained Python script in sagemaker-core that handles CodeArtifact authentication before installing requirements. It:

  1. Reads CA_REPOSITORY_ARN from the container environment — if not set, does a normal pip install
  2. Tries boto3 first (matching sagemaker-training-toolkit's approach) to build an authenticated pip index URL
  3. Falls back to AWS CLI (aws codeartifact login --tool pip) if boto3 is unavailable
  4. Hard-fails with a clear error if CA_REPOSITORY_ARN is set but neither boto3 nor AWS CLI is available

The script can be used as:

  • A standalone script: python install_requirements.py requirements.txt (used by bash-based entrypoints)
  • An importable module: from sagemaker.core.utils.install_requirements import configure_pip, install_requirements (for Python-native callers like @remote or ModelBuilder)

Changes

File Change
sagemaker-core/.../utils/install_requirements.py New module with configure_pip(), install_requirements(), main(), and CodeArtifactAuthMethod enum
sagemaker-core/tests/unit/test_install_requirements.py 22 unit tests covering all auth methods, fallback chains, error propagation
sagemaker-train/.../templates.py INSTALL_REQUIREMENTS and INSTALL_AUTO_REQUIREMENTS now call install_requirements.py instead of bare pip install
sagemaker-train/.../model_trainer.py Copy install_requirements.py from sagemaker-core into sm_drivers/scripts/ at runtime
sagemaker-core/.../processing.py Upload install_requirements.py as sibling file alongside runproc.sh and sourcedir.tar.gz; update generated script to call it
sagemaker-core/tests/unit/test_processing.py Verify install_requirements.py is uploaded and referenced in generated script

What this covers

Job Type Class CodeArtifact with this PR
Training ModelTrainer ✅ Fixed — install_requirements.py in sm_drivers/scripts/
Processing FrameworkProcessor ✅ Fixed — install_requirements.py uploaded as sibling file
Tuning Tuner ✅ Already works — Tuner uses container's native toolkit (not affected by this PR)
Inference ModelBuilder ✅ Already works — SDK doesn't override inference entrypoints

What this does NOT cover

Path Status Notes
@remote function (runtime_environment_manager.py) ❌ Not wired Has its own _install_requirements_txt() that does bare pip install. Could use configure_pip() via import.
sagemaker-serve (requirements_manager.py) ❌ Not wired Same — bare pip install in-process. Could import configure_pip().
sagemaker-core/modules (templates.py) ❌ Not wired Duplicate of sagemaker-train/templates.py without INSTALL_REQUIREMENTS. Lower priority.

These are follow-up opportunities — the module is available for them to import.

Known risks

  1. Tuning jobs depend on the container's toolkit — The CreateHyperParameterTuningJob API uses HyperParameterAlgorithmSpecification which lacks ContainerEntrypoint, so the Tuner cannot use sm_train.sh. If future containers drop sagemaker-training-toolkit, tuning jobs will lose CodeArtifact support with no SDK-side fix possible until the API adds entrypoint support.

  2. boto3 availability in future containers — Current PyTorch training containers (2.7–2.9) include boto3. New DLC base images on the main branch do not. The script's fallback to AWS CLI mitigates this, but if neither is available, the script hard-fails. The long-term solution is a shared package with boto3 as a declared dependency (see analysis).

Long-term solution

This PR is a stopgap that works within the SDK alone. The long-term solution requires coordination between the SDK and DLC to ensure that both the container's default entrypoint and any SDK-overridden entrypoint have access to the same CodeArtifact-aware installer — ideally a shared package with boto3 as a declared dependency, installed in all SageMaker containers. See the proposed ideal solution for details.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…r requirements.txt installation

SDK v3's `ModelTrainer` and `FrameworkProcessor` override the container entrypoint with SDK-generated scripts (`sm_train.sh`, `runproc.sh`), bypassing the container's entrypoint which involved `sagemaker-training-toolkit` handling
`CA_REPOSITORY_ARN`-based CodeArtifact authentication. This broke CodeArtifact support for both training (Bug 4) and processing (Bug 3) reported in aws#5765.

This is the stopgap solution proposed in this comment[aws#5765 (comment)]: a self-contained install_requirements.py script that the SDK uploads to the container alongside its generated entrypoint scripts.

- Add `install_requirements.py` in sagemaker-core — reads `CA_REPOSITORY_ARN` from container environment; no-op if unset
- Try `boto3` first (matching sagemaker-training-toolkit), fall back to `AWS CLI`, hard-fail if neither is available
- Wire into `ModelTrainer`: copy script into `sm_drivers/scripts/`, update `INSTALL_REQUIREMENTS` templates to call it instead of bare `pip install`
- Wire into `FrameworkProcessor`: upload script as sibling file alongside `runproc.sh`, update generated script to call it
@humanzz
Copy link
Copy Markdown
Contributor Author

humanzz commented Apr 17, 2026

I've also tested this with my code (as an integration test) to verify the behaviours

Training Job (ModelTrainer)

from sagemaker.train import ModelTrainer
from sagemaker.core.training.configs import Compute, SourceCode

trainer = ModelTrainer(
    training_image="763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.10-cpu-py313",
    role="arn:aws:iam::123456789:role/MyRole",
    source_code=SourceCode(
        entry_script="train.py", source_dir="src", requirements="requirements.txt"
    ),
    compute=Compute(instance_type="ml.m5.xlarge", instance_count=1),
    sagemaker_session=sm_session,
    environment={"CA_REPOSITORY_ARN": "arn:aws:codeartifact:us-west-2:ACCOUNT:repository/DOMAIN/REPO"},
)
trainer.train(input_data_config=inputs, wait=False)

CloudWatch logsinstall_requirements.py ran from sm_drivers/scripts/, authenticated via boto3, pip resolved from CodeArtifact:

Installing requirements
++ /usr/local/bin/python3 /opt/ml/input/data/sm_drivers/scripts/install_requirements.py requirements.txt
Looking in indexes: https://aws:****@amazon-ACCOUNT.d.codeartifact.us-west-2.amazonaws.com/pypi/REPO/simple/
  Downloading https://amazon-ACCOUNT.d.codeartifact.us-west-2.amazonaws.com/pypi/REPO/simple/pyarrow/20.0.0/pyarrow-20.0.0-cp313-cp313-manylinux_2_28_x86_64.whl (42.3 MB)
  Downloading https://amazon-ACCOUNT.d.codeartifact.us-west-2.amazonaws.com/pypi/REPO/simple/sentence-transformers/5.4.1/sentence_transformers-5.4.1-py3-none-any.whl (571 kB)

Training job completed successfully. ✅

Processing Job (FrameworkProcessor)

from sagemaker.core.processing import FrameworkProcessor

processor = FrameworkProcessor(
    image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.10-gpu-py313",
    command=["python3"],
    role="arn:aws:iam::123456789:role/MyRole",
    instance_count=1,
    instance_type="ml.g6.4xlarge",
    sagemaker_session=sm_session,
    env={"CA_REPOSITORY_ARN": "arn:aws:codeartifact:us-west-2:ACCOUNT:repository/DOMAIN/REPO"},
)
processor.run(code="my_script.py", source_dir="src", wait=False)

CloudWatch logsinstall_requirements.py uploaded as sibling file, authenticated via boto3:

Files in /opt/ml/processing/input/code/ before extraction:
-rw-r--r-- 1 root root  6652 Apr 17 10:22 install_requirements.py
-rw-r--r-- 1 root root   685 Apr 17 10:22 runproc.sh
-rw-r--r-- 1 root root 81582 Apr 17 10:22 sourcedir.tar.gz

Looking in indexes: https://aws:****@amazon-ACCOUNT.d.codeartifact.us-west-2.amazonaws.com/pypi/REPO/simple/
  Downloading https://amazon-ACCOUNT.d.codeartifact.us-west-2.amazonaws.com/pypi/REPO/simple/pyarrow/20.0.0/pyarrow-20.0.0-cp313-cp313-manylinux_2_28_x86_64.whl (42.3 MB)
  Downloading https://amazon-ACCOUNT.d.codeartifact.us-west-2.amazonaws.com/pypi/REPO/simple/sentence-transformers/5.4.1/sentence_transformers-5.4.1-py3-none-any.whl (571 kB)

Processing job completed successfully. ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants