Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions .github/workflows/cd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,48 @@ jobs:
with:
verbose: true

upload-gpu-test-asset:
name: Upload gpu_test binary to release
needs: [release-please, pypi-publish]
if: ${{ always() && (needs.release-please.outputs.release_created || (github.event_name == 'workflow_dispatch' && inputs.force_publish == 'true')) }}
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- name: Checkout code
uses: actions/checkout@v5
with:
ref: ${{ needs.release-please.outputs.tag_name }}

- name: Login to Docker Hub (optional)
if: ${{ vars.DOCKERHUB_USERNAME != '' }}
uses: docker/login-action@v3
with:
username: ${{ vars.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}

- name: Compile gpu_test binary
run: |
cd build_tools
./compile_gpu_test.sh
cd ..
test -f runpod/serverless/binaries/gpu_test

- name: Generate sha256 checksum
working-directory: runpod/serverless/binaries
run: |
sha256sum gpu_test > gpu_test.sha256
cat gpu_test.sha256

- name: Upload binary to release
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
gh release upload "${{ needs.release-please.outputs.tag_name }}" \
runpod/serverless/binaries/gpu_test \
runpod/serverless/binaries/gpu_test.sha256 \
--clobber

# TODO: Re-enable after optimizing (17 parallel jobs each sleeping 5min is wasteful).
# Consider a single job that sleeps once then dispatches sequentially.
# notify-workers:
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -142,3 +142,6 @@ runpod/_version.py

*.lock
benchmark_results/

# Locally-compiled CUDA test binary — CI compiles per-release
runpod/serverless/binaries/gpu_test
17 changes: 17 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,22 @@
# Changelog

## Unreleased

### Changed

- **gpu_test binary no longer bundled in the PyPI wheel.** Fixes installs on
Nix and other non-glibc platforms ([#498](https://github.com/runpod/runpod-python/issues/498)).
Runtime falls back to an `nvidia-smi`-based availability check when the
binary is missing. Runpod GPU workers should add
`RUN runpod install-gpu-test` after `pip install runpod` to restore the
native CUDA memory-allocation test.

### Added

- `runpod install-gpu-test` CLI command — downloads the `gpu_test` binary
from the GitHub release matching the installed runpod version, verifies
sha256, and installs it into the package's `serverless/binaries/` directory.

## [1.9.0](https://github.com/runpod/runpod-python/compare/v1.8.2...v1.9.0) (2026-04-08)


Expand Down
2 changes: 1 addition & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
include runpod/serverless/binaries/gpu_test
include runpod/serverless/binaries/README.md
include build_tools/gpu_test.c
include build_tools/compile_gpu_test.sh
exclude runpod/serverless/binaries/gpu_test
19 changes: 18 additions & 1 deletion docs/serverless/gpu_binary_compilation.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,30 @@ This document explains how to rebuild the `gpu_test` binary for GPU health check

## When to Rebuild

You typically **do not need to rebuild** the binary. A pre-compiled version is included in the runpod-python package and works across most GPU environments. Rebuild only when:
You typically **do not need to rebuild** the binary. A pre-compiled version is published as a GitHub release asset and can be installed with `runpod install-gpu-test` (see next section). Rebuild only when:

- You need to modify the GPU test logic (in `build_tools/gpu_test.c`)
- Targeting specific new CUDA versions
- Adding support for new GPU architectures
- Fixing compilation issues for your specific environment

## Installing from a release

As of v1.10.0, the `gpu_test` binary is **not bundled** in the PyPI wheel so the package stays platform-agnostic (fixes [#498](https://github.com/runpod/runpod-python/issues/498) — Nix / non-glibc builds).

Runpod GPU workers that want the native CUDA memory-allocation test back should run:

```bash
pip install runpod
runpod install-gpu-test
```

This downloads `gpu_test` from the GitHub release matching the installed runpod version, verifies its sha256, and places it at `runpod/serverless/binaries/gpu_test` inside the installed package.

If the binary is missing, the runtime falls back to an `nvidia-smi`-based availability check (no memory-allocation test).

Advanced users can override the binary path with the `RUNPOD_BINARY_GPU_TEST_PATH` environment variable.

## Prerequisites

You need Docker installed to build the binary:
Expand Down
13 changes: 11 additions & 2 deletions docs/serverless/worker_fitness_checks.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,10 +169,19 @@ GPU workers automatically run a built-in fitness check that validates GPU memory
The check:
- Tests actual GPU memory allocation (cudaMalloc) to ensure GPUs are accessible
- Enumerates all detected GPUs and validates each one
- Uses a native CUDA binary for comprehensive testing
- Falls back to Python-based checks if the binary is unavailable
- Uses a native CUDA binary for comprehensive testing (opt-in; see below)
- Falls back to an `nvidia-smi` availability check if the binary is unavailable
- Skips silently on CPU-only workers (allows same code for CPU/GPU)

**Installing the native binary**: as of v1.10.0 the `gpu_test` binary is not
bundled in the PyPI wheel. Runpod GPU worker Dockerfiles should add:

```dockerfile
RUN pip install runpod && runpod install-gpu-test
```

See [GPU Binary Compilation](./gpu_binary_compilation.md) for details.

```python
import runpod

Expand Down
1 change: 0 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,6 @@ include-package-data = true

[tool.setuptools.package-data]
runpod = [
"serverless/binaries/gpu_test",
"serverless/binaries/README.md",
]

Expand Down
2 changes: 2 additions & 0 deletions runpod/cli/entry.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

from .groups.config.commands import config_wizard
from .groups.exec.commands import exec_cli
from .groups.install.commands import install_gpu_test_cli
from .groups.pod.commands import pod_cli
from .groups.project.commands import project_cli
from .groups.ssh.commands import ssh_cli
Expand All @@ -24,3 +25,4 @@ def runpod_cli():
runpod_cli.add_command(pod_cli) # runpod pod
runpod_cli.add_command(exec_cli) # runpod exec
runpod_cli.add_command(project_cli) # runpod project
runpod_cli.add_command(install_gpu_test_cli) # runpod install-gpu-test
1 change: 1 addition & 0 deletions runpod/cli/groups/install/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""GPU test binary installer CLI."""
65 changes: 65 additions & 0 deletions runpod/cli/groups/install/commands.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
"""
CLI commands for installing optional runpod binaries.
"""

from __future__ import annotations

import sys
from pathlib import Path

import click

import runpod
from runpod.version import get_version

from .functions import (
BinaryChecksumMismatch,
BinaryDownloadError,
download_gpu_test_binary,
)


def _default_install_path() -> Path:
"""Package-local binaries dir — the same path _binary_helpers checks."""
return Path(runpod.__file__).parent / "serverless" / "binaries" / "gpu_test"


@click.command(
"install-gpu-test",
help=(
"Download the optional gpu_test CUDA health-check binary from the "
"GitHub release matching the installed runpod version. "
"Runpod GPU workers only — no-op on CPU-only environments."
),
Comment on lines +29 to +33
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command help text says this is a "no-op on CPU-only environments", but install_gpu_test_cli always attempts the download regardless of GPU availability. Either remove that claim from the help text or implement an explicit CPU-only short-circuit so behavior matches the CLI help.

Copilot uses AI. Check for mistakes.
)
@click.option(
"--version",
"version",
default=None,
help="Release tag to download (defaults to installed runpod version).",
)
@click.option(
"--dest",
"dest",
type=click.Path(dir_okay=False, writable=True, path_type=Path),
default=None,
help="Override destination path. Defaults to the package's binaries dir.",
)
def install_gpu_test_cli(version: str | None, dest: Path | None) -> None:
version = version or get_version()
if version == "unknown":
click.echo(
"Cannot determine installed runpod version; pass --version explicitly.",
err=True,
)
sys.exit(1)

target = dest or _default_install_path()

try:
installed_at = download_gpu_test_binary(version=version, dest=target)
except (BinaryDownloadError, BinaryChecksumMismatch) as exc:
click.echo(f"Failed to install gpu_test: {exc}", err=True)
sys.exit(1)

click.echo(f"Installed gpu_test at {installed_at}")
108 changes: 108 additions & 0 deletions runpod/cli/groups/install/functions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
"""
Download and install the optional gpu_test binary from a GitHub release.

The binary is NOT bundled in PyPI wheels to keep them universal
(py3-none-any). Runpod GPU workers that want the native CUDA memory
allocation test can fetch it from the GitHub release matching their
installed runpod version.

See docs/serverless/gpu_binary_compilation.md for usage.
"""

from __future__ import annotations

import hashlib
import os
import tempfile
import urllib.error
import urllib.request
from dataclasses import dataclass
from pathlib import Path

GITHUB_REPO = "runpod/runpod-python"
DOWNLOAD_TIMEOUT_SECONDS = 60


@dataclass(frozen=True)
class ReleaseAssetUrls:
binary: str
checksum: str


class BinaryDownloadError(RuntimeError):
"""Raised when the binary or checksum cannot be fetched."""


class BinaryChecksumMismatch(RuntimeError):
"""Raised when the downloaded binary's sha256 does not match the expected value."""


def release_asset_urls(version: str) -> ReleaseAssetUrls:
"""Build release-asset URLs for a given runpod version.

Accepts either '1.9.0' or 'v1.9.0' — the leading 'v' is optional.
"""
clean = version.lstrip("v")
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

version.lstrip("v") strips all leading v characters (e.g., "vv1.2.3" becomes "1.2.3"), which is broader than intended. Prefer removing only a single leading "v" (e.g., removeprefix("v") or version[1:] if version.startswith("v") else version) to avoid surprising tag construction.

Suggested change
clean = version.lstrip("v")
clean = version[1:] if version.startswith("v") else version

Copilot uses AI. Check for mistakes.
base = f"https://github.com/{GITHUB_REPO}/releases/download/v{clean}/gpu_test"
return ReleaseAssetUrls(binary=base, checksum=f"{base}.sha256")


def _fetch(url: str) -> bytes:
try:
with urllib.request.urlopen(url, timeout=DOWNLOAD_TIMEOUT_SECONDS) as response:
return response.read()
except urllib.error.HTTPError as exc:
raise BinaryDownloadError(
f"HTTP {exc.code} fetching {url}: {exc.reason}"
) from exc
except urllib.error.URLError as exc:
raise BinaryDownloadError(
f"Network error fetching {url}: {exc.reason!r}"
) from exc


def _parse_sha256(checksum_body: bytes) -> str:
"""Extract the hex digest from a 'sha256 filename' line."""
text = checksum_body.decode("utf-8", errors="replace").strip()
first_token = text.split()[0] if text else ""
if len(first_token) != 64:
raise BinaryDownloadError(
f"checksum file did not contain a sha256 digest: {text!r}"
)
return first_token.lower()
Comment on lines +64 to +72
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_sha256() only checks token length, not that it is valid hex. A 64-character non-hex token will incorrectly pass parsing and then fail later as BinaryChecksumMismatch, which misclassifies the problem. Validate with a hex regex / string.hexdigits check and raise BinaryDownloadError when the checksum file content is malformed.

Copilot uses AI. Check for mistakes.


def download_gpu_test_binary(version: str, dest: Path) -> Path:
"""Download gpu_test from the matching GitHub release and install it at dest.

Verifies sha256 before writing to the final destination. On checksum
mismatch or HTTP failure, no partial file is left at dest.

Returns the destination path on success.
"""
urls = release_asset_urls(version)

checksum_body = _fetch(urls.checksum)
expected_sha = _parse_sha256(checksum_body)

binary_body = _fetch(urls.binary)
actual_sha = hashlib.sha256(binary_body).hexdigest()
if actual_sha != expected_sha:
raise BinaryChecksumMismatch(
f"sha256 mismatch for {urls.binary} "
f"({len(binary_body)} bytes): "
f"expected {expected_sha}, got {actual_sha}"
)

dest.parent.mkdir(parents=True, exist_ok=True)
with tempfile.NamedTemporaryFile(dir=dest.parent, delete=False) as tmp:
tmp.write(binary_body)
tmp_path = Path(tmp.name)

try:
os.chmod(tmp_path, 0o750)

Check failure

Code scanning / CodeQL

Overly permissive file permissions High

Overly permissive mask in chmod sets file to group readable.
os.replace(tmp_path, dest)
except OSError:
tmp_path.unlink(missing_ok=True)
raise
return dest
21 changes: 19 additions & 2 deletions runpod/serverless/binaries/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,24 @@ Pre-compiled GPU health check binary for Linux x86_64.

## Files

- `gpu_test` - Compiled binary for CUDA GPU memory allocation testing
- `gpu_test` - Compiled binary for CUDA GPU memory allocation testing (not
bundled in the PyPI wheel; see below)

## Availability

As of runpod v1.10.0 this binary is **not included** in the PyPI wheel. The
universal `py3-none-any` wheel would otherwise advertise itself as
platform-agnostic while shipping a Linux x86_64 ELF, which breaks Nix and
other strict packagers (see [#498](https://github.com/runpod/runpod-python/issues/498)).

Runpod GPU workers can download the matching binary with:

```bash
runpod install-gpu-test
```

This fetches the asset from the GitHub release matching the installed runpod
version and verifies its sha256.

## Compatibility

Expand All @@ -29,7 +46,7 @@ GPU 0 memory allocation test passed.

## Building

See `build_tools/compile_gpu_test.sh` and `docs/serverless/gpu_binary_compilation.md` for compilation instructions.
See `build_tools/compile_gpu_test.sh` and `docs/serverless/gpu_binary_compilation.md`.

## License

Expand Down
Binary file removed runpod/serverless/binaries/gpu_test
Binary file not shown.
1 change: 0 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,6 @@
include_package_data=True,
package_data={
"runpod": [
"serverless/binaries/gpu_test",
"serverless/binaries/README.md",
]
},
Expand Down
1 change: 1 addition & 0 deletions tests/test_cli/test_install/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Loading
Loading