Skip to content

feat(sbom): use annotated data types in Pydantic config#1090

Open
smoparth wants to merge 1 commit intomainfrom
feat/sbom-annotated-types
Open

feat(sbom): use annotated data types in Pydantic config#1090
smoparth wants to merge 1 commit intomainfrom
feat/sbom-annotated-types

Conversation

@smoparth
Copy link
Copy Markdown
Contributor

@smoparth smoparth commented Apr 24, 2026

Replace generic str fields in SbomSettings and PurlConfig with specialized annotated types that validate input and make the schema self-documenting.

New annotated types:

  • PurlType: strips, lowercases, rejects empty purl type strings
  • RepositoryUrl: validates http/https URL with scheme and host
  • UpstreamPurl: validates purl strings via PackageURL.from_string()
  • SpdxNamespace: validates SPDX documentNamespace as a URL
  • SpdxActor: validates SPDX actor format (Organization/Person/Tool/NOASSERTION)

The @field_validator on PurlConfig.upstream is replaced by the UpstreamPurl annotated type, moving validation into the type itself.

Closes: #1072

Pull Request Description

@smoparth smoparth requested a review from a team as a code owner April 24, 2026 13:20
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7f689dab-ebeb-48e4-9335-2046be7b7dea

📥 Commits

Reviewing files that changed from the base of the PR and between a271eb0 and 523d4fd.

📒 Files selected for processing (5)
  • src/fromager/packagesettings/__init__.py
  • src/fromager/packagesettings/_models.py
  • src/fromager/packagesettings/_typedefs.py
  • tests/test_packagesettings.py
  • tests/test_sbom.py
✅ Files skipped from review due to trivial changes (2)
  • src/fromager/packagesettings/init.py
  • src/fromager/packagesettings/_typedefs.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_packagesettings.py

📝 Walkthrough

Walkthrough

This change introduces annotated Pydantic string typedefs in src/fromager/packagesettings/_typedefs.py: PurlType, RepositoryUrl, UpstreamPurl, SpdxNamespace, and SpdxActor, each with validation/normalization logic (HTTP URL validation, upstream PURL parsing, SPDX actor pattern, trimming/lowercasing). src/fromager/packagesettings/_models.py updates SbomSettings and PurlConfig fields to use these types and removes the prior upstream field validator. src/fromager/packagesettings/__init__.py re-exports the new types. Tests were added/updated to exercise the new typedef validations and adjusted SBOM expectations for repository URL usage.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(sbom): use annotated data types in Pydantic config' accurately describes the main change: introducing specialized annotated types to replace generic string fields in SBOM configuration.
Description check ✅ Passed The description clearly explains the motivation (schema self-documentation), lists all new annotated types with their validation behavior, and references the linked issue #1072.
Linked Issues check ✅ Passed The PR fully addresses #1072 by replacing generic string fields with specialized annotated types (PurlType, RepositoryUrl, UpstreamPurl, SpdxNamespace, SpdxActor) that embed validation and make the schema self-documenting.
Out of Scope Changes check ✅ Passed All changes directly support the stated objectives: new typedefs with validators, updated models to use them, updated tests for validation behavior, and test data adjustments for the new repository_url qualifier.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mergify mergify Bot added the ci label Apr 24, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
tests/test_packagesettings.py (1)

498-569: Solid coverage; one optional edge case

Tests follow the existing test_type_* pattern and exercise normalization + rejection for each typedef. Consider adding one case for SpdxActor rejecting an unknown category (e.g. "Robot: Bender") to lock in the regex's closed set of Organization|Person|Tool:

Optional addition
     with pytest.raises(ValueError):
         ta.validate_python("Organization: ")
+    with pytest.raises(ValueError):
+        ta.validate_python("Robot: Bender")
     with pytest.raises(ValueError):
         ta.validate_python("")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_packagesettings.py` around lines 498 - 569, Add an extra assertion
to test_type_spdx_actor to assert that an unknown SPDX actor category is
rejected: use the existing pydantic.TypeAdapter(SpdxActor) in
test_type_spdx_actor and add a with pytest.raises(ValueError):
ta.validate_python("Robot: Bender") to ensure the SpdxActor regex only allows
the closed set Organization|Person|Tool (reference SpdxActor and
test_type_spdx_actor).
src/fromager/packagesettings/_typedefs.py (2)

104-104: re.DOTALL allows newlines inside actor names

With re.DOTALL, \S.*$ will happily match "Organization: John\nDoe". SPDX actor names shouldn't span lines. The flag looks unnecessary here — dropping it tightens validation with no downside.

Proposed tweak
-_SPDX_ACTOR_RE = re.compile(r"^(Organization|Person|Tool):\s+\S.*$", flags=re.DOTALL)
+_SPDX_ACTOR_RE = re.compile(r"^(Organization|Person|Tool):\s+\S.*$")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fromager/packagesettings/_typedefs.py` at line 104, The _SPDX_ACTOR_RE
currently uses re.DOTALL which lets \S.* match across newlines (allowing
multi-line actor names); remove the flags=re.DOTALL from the re.compile call
(i.e., compile without DOTALL) so that \S and . do not match newlines and actor
names are constrained to a single line; update the _SPDX_ACTOR_RE definition
accordingly.

107-117: Consider Pydantic's AnyHttpUrl for stricter URL validation, but current implementation is sufficient.

The custom _validate_url works well for your use case (http/https, non-empty host), as confirmed by tests and SBOM generation. Switching to Pydantic's AnyHttpUrl would catch edge cases like https://@ but isn't necessary unless stricter validation becomes a requirement. If you do migrate, use Annotated[AnyHttpUrl, BeforeValidator(...)] returning the Url object, or Annotated[AnyHttpUrl, AfterValidator(str)] to keep plain strings (the latter applies normalization first, then converts to string).

Also applies to: 138-141, 164-167

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fromager/packagesettings/_typedefs.py` around lines 107 - 117, The custom
_validate_url function enforces http/https and non-empty host but could be
replaced with Pydantic's AnyHttpUrl for stricter validation; if you choose to
migrate, update type annotations where _validate_url is used to
Annotated[AnyHttpUrl, BeforeValidator(<normalizer>)] to return a Url object or
Annotated[AnyHttpUrl, AfterValidator(str)] to keep plain strings, and remove or
delegate logic from _validate_url accordingly (see references to _validate_url
and the alternative BeforeValidator/AfterValidator patterns); otherwise leave
_validate_url as-is since current tests and SBOM are sufficient.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/fromager/packagesettings/_models.py`:
- Around line 53-68: Update the release notes to document that the SPDX-related
fields in PackageSettings (supplier, creators, namespace, repository_url) now
enforce SPDX 2.3 via strict validators and will raise ValidationError for
non-compliant values; explicitly list the breaking examples given (supplier
"John Smith" must be prefixed with Organization:/Person:/Tool: or be
NOASSERTION, repository_url must include https://, namespace must include
https://) so downstream users know configs may fail to load after the upgrade.

---

Nitpick comments:
In `@src/fromager/packagesettings/_typedefs.py`:
- Line 104: The _SPDX_ACTOR_RE currently uses re.DOTALL which lets \S.* match
across newlines (allowing multi-line actor names); remove the flags=re.DOTALL
from the re.compile call (i.e., compile without DOTALL) so that \S and . do not
match newlines and actor names are constrained to a single line; update the
_SPDX_ACTOR_RE definition accordingly.
- Around line 107-117: The custom _validate_url function enforces http/https and
non-empty host but could be replaced with Pydantic's AnyHttpUrl for stricter
validation; if you choose to migrate, update type annotations where
_validate_url is used to Annotated[AnyHttpUrl, BeforeValidator(<normalizer>)] to
return a Url object or Annotated[AnyHttpUrl, AfterValidator(str)] to keep plain
strings, and remove or delegate logic from _validate_url accordingly (see
references to _validate_url and the alternative BeforeValidator/AfterValidator
patterns); otherwise leave _validate_url as-is since current tests and SBOM are
sufficient.

In `@tests/test_packagesettings.py`:
- Around line 498-569: Add an extra assertion to test_type_spdx_actor to assert
that an unknown SPDX actor category is rejected: use the existing
pydantic.TypeAdapter(SpdxActor) in test_type_spdx_actor and add a with
pytest.raises(ValueError): ta.validate_python("Robot: Bender") to ensure the
SpdxActor regex only allows the closed set Organization|Person|Tool (reference
SpdxActor and test_type_spdx_actor).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bfa425fc-a49f-4523-ac71-4e4b267429d2

📥 Commits

Reviewing files that changed from the base of the PR and between 05c65ab and a271eb0.

📒 Files selected for processing (4)
  • src/fromager/packagesettings/__init__.py
  • src/fromager/packagesettings/_models.py
  • src/fromager/packagesettings/_typedefs.py
  • tests/test_packagesettings.py

Comment thread src/fromager/packagesettings/_models.py
@smoparth smoparth requested a review from mprpic April 24, 2026 13:29
Copy link
Copy Markdown
Contributor

@mprpic mprpic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine overall, but I do have to question the level of validation we're going into. If this is the intent and other parts of the Fromager code base expect this type of detailed validation, then fine with me.

GlobalChangelog = Mapping[Variant, list[str]]
VariantChangelog = Mapping[PackageVersion, list[str]]

_SPDX_ACTOR_RE = re.compile(r"^(Organization|Person|Tool):\s+\S.*$", flags=re.DOTALL)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The official SPDX 2.3 schema lists these as examples but doesn't actually enforce it at the JSON schema level, it only expects a string:

https://github.com/spdx/tools-java/blob/master/resources/spdx-schema-v2.3.json#L58

So I think we're being overly defensive here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on project standards to validate the user inputs I added those checks but I'm open to adjusting the strictness, happy to widen or tighten the checks based on what is preferred.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would relax it to any string, not strict to Org/Person/Tool. Follow the SPDX2.3 JSON schema, if that allows any string, so should we.

Comment thread src/fromager/packagesettings/_typedefs.py Outdated
Comment thread src/fromager/packagesettings/_typedefs.py Outdated
Comment thread src/fromager/packagesettings/_typedefs.py
Copy link
Copy Markdown
Collaborator

@tiran tiran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smoparth You can get rid of most helper functions and type checks by using Pydantic's built-in constraints, dedicated data types, and after validator instead of before validator. Pydantic has great documentation that explains its validator concept very well.

@mprpic The checks are going to detect misconfiguration at startup, e.g. in fromager lint.

Comment on lines +107 to +117
def _validate_url(v: str) -> str:
"""Validate that *v* has an ``http`` or ``https`` scheme and a host."""
if not isinstance(v, str):
raise TypeError(f"expected str, got {type(v)}: {v!r}")
v = v.strip()
parsed = urlparse(v)
if parsed.scheme not in ("http", "https"):
raise ValueError(f"URL must use http or https scheme, got {v!r}")
if not parsed.netloc:
raise ValueError(f"URL is missing a host/netloc component: {v!r}")
return v
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pydantic has a dedicated validator and constraint for network types.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tiran ! I used Annotated[AnyHttpUrl, AfterValidator(str)] initially, it works at runtime since AfterValidator(str) converts the Url object back to a plain string.
However, mypy sees the static type as AnyHttpUrl, not str, which causes type errors on default values (namespace: SpdxNamespace = "https://...") and dict assignments in sbom.py requiring type: ignore comments across multiple files.
To avoid that, I found a work-around using TypeAdapter, which keeps the declared type as str. Pushed in the latest commit. Let me know if there's a better approach.

Comment thread src/fromager/packagesettings/_typedefs.py Outdated
Comment thread src/fromager/packagesettings/_typedefs.py Outdated
Replace generic `str` fields in SbomSettings and PurlConfig with
specialized annotated types that validate input and make the schema
self-documenting.

New annotated types:
- PurlType: strips, lowercases, rejects empty purl type strings using StringConstraints
- RepositoryUrl: AnyHttpUrl via TypeAdapter for http/https validation
- UpstreamPurl: validates purl strings via PackageURL.from_string()
- SpdxNamespace: validates SPDX documentNamespace as a URL
- SpdxActor: validates SPDX actor format (Organization/Person/Tool/NOASSERTION)

The @field_validator on PurlConfig.upstream is replaced by the
UpstreamPurl annotated type, moving validation into the type itself.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Closes: #1072
Signed-off-by: Shanmukh Pawan <smoparth@redhat.com>
@smoparth smoparth force-pushed the feat/sbom-annotated-types branch from a271eb0 to 523d4fd Compare April 27, 2026 13:37
@smoparth smoparth requested review from mprpic and tiran April 27, 2026 13:51
"""Default purl type for all packages (e.g. ``pypi``, ``generic``)"""

repository_url: str | None = None
repository_url: RepositoryUrl | None = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it makse sense to use AnyHttpUrl instead of defining a custom type RepositoryUrl?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the challenge in #1090 (comment).

The issue is that AnyHttpUrl resolves to a pydantic.Url object, not str. The consumer of this model (sbom.py) uses these values in f-strings (f"{namespace}/{name}-{version}.spdx.json") and assigns them into dict[str, str] (qualifiers["repository_url"] = repo_url). With AnyHttpUrl as the field type, mypy flags all of these as type errors since it expects str, not Url. We'd either need to update sbom.py to call str() on every access (which expands the scope of this PR), or add # type: ignore comments across multiple files.

The RepositoryUrl / SpdxNamespace wrappers exist purely to keep the field type as str while delegating validation to AnyHttpUrl internally. If you'd prefer, I can switch the fields to AnyHttpUrl directly and update sbom.py to handle the Url type -- happy to go either way.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to use AnyUrl and convert it to str in a few places. @mprpic what would you prefer?

If we decide to use a custom validation type, then we don't need need separate RepositoryUrl and SpdxNamespace types. A single StringUrl type would do.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AnyUrl sounds good to me!

"""SPDX supplier field for the wheel package (e.g. ``Organization: ExampleCo``)"""

namespace: str = "https://spdx.org/spdxdocs"
namespace: SpdxNamespace = "https://spdx.org/spdxdocs"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, why not use AnyHttpUrl type?


SpdxActor = typing.Annotated[
str,
StringConstraints(strip_whitespace=True, min_length=1),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is StringConstraints necessary? _validate_spdx_actor will fail with a better error message when the input is misformed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I wil remove it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to export these types? An exported symbol becomes part of the stable API.

"""Default purl type for all packages (e.g. ``pypi``, ``generic``)"""

repository_url: str | None = None
repository_url: RepositoryUrl | None = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to use AnyUrl and convert it to str in a few places. @mprpic what would you prefer?

If we decide to use a custom validation type, then we don't need need separate RepositoryUrl and SpdxNamespace types. A single StringUrl type would do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SBOM: use annotated data types in Pydantic config

3 participants