feat(sbom): use annotated data types in Pydantic config#1090
feat(sbom): use annotated data types in Pydantic config#1090
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis change introduces annotated Pydantic string typedefs in Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tests/test_packagesettings.py (1)
498-569: Solid coverage; one optional edge caseTests follow the existing
test_type_*pattern and exercise normalization + rejection for each typedef. Consider adding one case forSpdxActorrejecting an unknown category (e.g."Robot: Bender") to lock in the regex's closed set ofOrganization|Person|Tool:Optional addition
with pytest.raises(ValueError): ta.validate_python("Organization: ") + with pytest.raises(ValueError): + ta.validate_python("Robot: Bender") with pytest.raises(ValueError): ta.validate_python("")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_packagesettings.py` around lines 498 - 569, Add an extra assertion to test_type_spdx_actor to assert that an unknown SPDX actor category is rejected: use the existing pydantic.TypeAdapter(SpdxActor) in test_type_spdx_actor and add a with pytest.raises(ValueError): ta.validate_python("Robot: Bender") to ensure the SpdxActor regex only allows the closed set Organization|Person|Tool (reference SpdxActor and test_type_spdx_actor).src/fromager/packagesettings/_typedefs.py (2)
104-104:re.DOTALLallows newlines inside actor namesWith
re.DOTALL,\S.*$will happily match"Organization: John\nDoe". SPDX actor names shouldn't span lines. The flag looks unnecessary here — dropping it tightens validation with no downside.Proposed tweak
-_SPDX_ACTOR_RE = re.compile(r"^(Organization|Person|Tool):\s+\S.*$", flags=re.DOTALL) +_SPDX_ACTOR_RE = re.compile(r"^(Organization|Person|Tool):\s+\S.*$")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/fromager/packagesettings/_typedefs.py` at line 104, The _SPDX_ACTOR_RE currently uses re.DOTALL which lets \S.* match across newlines (allowing multi-line actor names); remove the flags=re.DOTALL from the re.compile call (i.e., compile without DOTALL) so that \S and . do not match newlines and actor names are constrained to a single line; update the _SPDX_ACTOR_RE definition accordingly.
107-117: Consider Pydantic'sAnyHttpUrlfor stricter URL validation, but current implementation is sufficient.The custom
_validate_urlworks well for your use case (http/https, non-empty host), as confirmed by tests and SBOM generation. Switching to Pydantic'sAnyHttpUrlwould catch edge cases likehttps://@but isn't necessary unless stricter validation becomes a requirement. If you do migrate, useAnnotated[AnyHttpUrl, BeforeValidator(...)]returning the Url object, orAnnotated[AnyHttpUrl, AfterValidator(str)]to keep plain strings (the latter applies normalization first, then converts to string).Also applies to: 138-141, 164-167
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/fromager/packagesettings/_typedefs.py` around lines 107 - 117, The custom _validate_url function enforces http/https and non-empty host but could be replaced with Pydantic's AnyHttpUrl for stricter validation; if you choose to migrate, update type annotations where _validate_url is used to Annotated[AnyHttpUrl, BeforeValidator(<normalizer>)] to return a Url object or Annotated[AnyHttpUrl, AfterValidator(str)] to keep plain strings, and remove or delegate logic from _validate_url accordingly (see references to _validate_url and the alternative BeforeValidator/AfterValidator patterns); otherwise leave _validate_url as-is since current tests and SBOM are sufficient.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/fromager/packagesettings/_models.py`:
- Around line 53-68: Update the release notes to document that the SPDX-related
fields in PackageSettings (supplier, creators, namespace, repository_url) now
enforce SPDX 2.3 via strict validators and will raise ValidationError for
non-compliant values; explicitly list the breaking examples given (supplier
"John Smith" must be prefixed with Organization:/Person:/Tool: or be
NOASSERTION, repository_url must include https://, namespace must include
https://) so downstream users know configs may fail to load after the upgrade.
---
Nitpick comments:
In `@src/fromager/packagesettings/_typedefs.py`:
- Line 104: The _SPDX_ACTOR_RE currently uses re.DOTALL which lets \S.* match
across newlines (allowing multi-line actor names); remove the flags=re.DOTALL
from the re.compile call (i.e., compile without DOTALL) so that \S and . do not
match newlines and actor names are constrained to a single line; update the
_SPDX_ACTOR_RE definition accordingly.
- Around line 107-117: The custom _validate_url function enforces http/https and
non-empty host but could be replaced with Pydantic's AnyHttpUrl for stricter
validation; if you choose to migrate, update type annotations where
_validate_url is used to Annotated[AnyHttpUrl, BeforeValidator(<normalizer>)] to
return a Url object or Annotated[AnyHttpUrl, AfterValidator(str)] to keep plain
strings, and remove or delegate logic from _validate_url accordingly (see
references to _validate_url and the alternative BeforeValidator/AfterValidator
patterns); otherwise leave _validate_url as-is since current tests and SBOM are
sufficient.
In `@tests/test_packagesettings.py`:
- Around line 498-569: Add an extra assertion to test_type_spdx_actor to assert
that an unknown SPDX actor category is rejected: use the existing
pydantic.TypeAdapter(SpdxActor) in test_type_spdx_actor and add a with
pytest.raises(ValueError): ta.validate_python("Robot: Bender") to ensure the
SpdxActor regex only allows the closed set Organization|Person|Tool (reference
SpdxActor and test_type_spdx_actor).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bfa425fc-a49f-4523-ac71-4e4b267429d2
📒 Files selected for processing (4)
src/fromager/packagesettings/__init__.pysrc/fromager/packagesettings/_models.pysrc/fromager/packagesettings/_typedefs.pytests/test_packagesettings.py
mprpic
left a comment
There was a problem hiding this comment.
This looks fine overall, but I do have to question the level of validation we're going into. If this is the intent and other parts of the Fromager code base expect this type of detailed validation, then fine with me.
| GlobalChangelog = Mapping[Variant, list[str]] | ||
| VariantChangelog = Mapping[PackageVersion, list[str]] | ||
|
|
||
| _SPDX_ACTOR_RE = re.compile(r"^(Organization|Person|Tool):\s+\S.*$", flags=re.DOTALL) |
There was a problem hiding this comment.
The official SPDX 2.3 schema lists these as examples but doesn't actually enforce it at the JSON schema level, it only expects a string:
https://github.com/spdx/tools-java/blob/master/resources/spdx-schema-v2.3.json#L58
So I think we're being overly defensive here.
There was a problem hiding this comment.
Based on project standards to validate the user inputs I added those checks but I'm open to adjusting the strictness, happy to widen or tighten the checks based on what is preferred.
There was a problem hiding this comment.
I would relax it to any string, not strict to Org/Person/Tool. Follow the SPDX2.3 JSON schema, if that allows any string, so should we.
tiran
left a comment
There was a problem hiding this comment.
@smoparth You can get rid of most helper functions and type checks by using Pydantic's built-in constraints, dedicated data types, and after validator instead of before validator. Pydantic has great documentation that explains its validator concept very well.
@mprpic The checks are going to detect misconfiguration at startup, e.g. in fromager lint.
| def _validate_url(v: str) -> str: | ||
| """Validate that *v* has an ``http`` or ``https`` scheme and a host.""" | ||
| if not isinstance(v, str): | ||
| raise TypeError(f"expected str, got {type(v)}: {v!r}") | ||
| v = v.strip() | ||
| parsed = urlparse(v) | ||
| if parsed.scheme not in ("http", "https"): | ||
| raise ValueError(f"URL must use http or https scheme, got {v!r}") | ||
| if not parsed.netloc: | ||
| raise ValueError(f"URL is missing a host/netloc component: {v!r}") | ||
| return v |
There was a problem hiding this comment.
Pydantic has a dedicated validator and constraint for network types.
There was a problem hiding this comment.
Thanks @tiran ! I used Annotated[AnyHttpUrl, AfterValidator(str)] initially, it works at runtime since AfterValidator(str) converts the Url object back to a plain string.
However, mypy sees the static type as AnyHttpUrl, not str, which causes type errors on default values (namespace: SpdxNamespace = "https://...") and dict assignments in sbom.py requiring type: ignore comments across multiple files.
To avoid that, I found a work-around using TypeAdapter, which keeps the declared type as str. Pushed in the latest commit. Let me know if there's a better approach.
Replace generic `str` fields in SbomSettings and PurlConfig with specialized annotated types that validate input and make the schema self-documenting. New annotated types: - PurlType: strips, lowercases, rejects empty purl type strings using StringConstraints - RepositoryUrl: AnyHttpUrl via TypeAdapter for http/https validation - UpstreamPurl: validates purl strings via PackageURL.from_string() - SpdxNamespace: validates SPDX documentNamespace as a URL - SpdxActor: validates SPDX actor format (Organization/Person/Tool/NOASSERTION) The @field_validator on PurlConfig.upstream is replaced by the UpstreamPurl annotated type, moving validation into the type itself. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Closes: #1072 Signed-off-by: Shanmukh Pawan <smoparth@redhat.com>
a271eb0 to
523d4fd
Compare
| """Default purl type for all packages (e.g. ``pypi``, ``generic``)""" | ||
|
|
||
| repository_url: str | None = None | ||
| repository_url: RepositoryUrl | None = None |
There was a problem hiding this comment.
Would it makse sense to use AnyHttpUrl instead of defining a custom type RepositoryUrl?
There was a problem hiding this comment.
I added the challenge in #1090 (comment).
The issue is that AnyHttpUrl resolves to a pydantic.Url object, not str. The consumer of this model (sbom.py) uses these values in f-strings (f"{namespace}/{name}-{version}.spdx.json") and assigns them into dict[str, str] (qualifiers["repository_url"] = repo_url). With AnyHttpUrl as the field type, mypy flags all of these as type errors since it expects str, not Url. We'd either need to update sbom.py to call str() on every access (which expands the scope of this PR), or add # type: ignore comments across multiple files.
The RepositoryUrl / SpdxNamespace wrappers exist purely to keep the field type as str while delegating validation to AnyHttpUrl internally. If you'd prefer, I can switch the fields to AnyHttpUrl directly and update sbom.py to handle the Url type -- happy to go either way.
There was a problem hiding this comment.
I would prefer to use AnyUrl and convert it to str in a few places. @mprpic what would you prefer?
If we decide to use a custom validation type, then we don't need need separate RepositoryUrl and SpdxNamespace types. A single StringUrl type would do.
| """SPDX supplier field for the wheel package (e.g. ``Organization: ExampleCo``)""" | ||
|
|
||
| namespace: str = "https://spdx.org/spdxdocs" | ||
| namespace: SpdxNamespace = "https://spdx.org/spdxdocs" |
There was a problem hiding this comment.
Same here, why not use AnyHttpUrl type?
|
|
||
| SpdxActor = typing.Annotated[ | ||
| str, | ||
| StringConstraints(strip_whitespace=True, min_length=1), |
There was a problem hiding this comment.
Is StringConstraints necessary? _validate_spdx_actor will fail with a better error message when the input is misformed.
There was a problem hiding this comment.
No, I wil remove it.
There was a problem hiding this comment.
Do we need to export these types? An exported symbol becomes part of the stable API.
| """Default purl type for all packages (e.g. ``pypi``, ``generic``)""" | ||
|
|
||
| repository_url: str | None = None | ||
| repository_url: RepositoryUrl | None = None |
There was a problem hiding this comment.
I would prefer to use AnyUrl and convert it to str in a few places. @mprpic what would you prefer?
If we decide to use a custom validation type, then we don't need need separate RepositoryUrl and SpdxNamespace types. A single StringUrl type would do.
Replace generic
strfields in SbomSettings and PurlConfig with specialized annotated types that validate input and make the schema self-documenting.New annotated types:
The @field_validator on PurlConfig.upstream is replaced by the UpstreamPurl annotated type, moving validation into the type itself.
Closes: #1072
Pull Request Description