Skip to content

feat: Support JSON Schema uploads and update related deployment scripts#566

Open
Prajwal-Microsoft wants to merge 5 commits intomainfrom
feature/json-schema-support
Open

feat: Support JSON Schema uploads and update related deployment scripts#566
Prajwal-Microsoft wants to merge 5 commits intomainfrom
feature/json-schema-support

Conversation

@Prajwal-Microsoft
Copy link
Copy Markdown
Collaborator

Purpose

This pull request introduces support for authoring and managing schemas as JSON Schema documents in addition to the legacy Python (.py) format. The changes improve security by allowing schemas to be uploaded as pure data (JSON), which are parsed and materialized in memory without executing any uploaded code. The documentation, deployment scripts, and backend logic have been updated to support this new recommended workflow, while maintaining backward compatibility with existing Python-based schemas.

Schema authoring and documentation improvements:

  • Updated docs/CustomizeSchemaData.md to recommend JSON Schema as the preferred format, explain its advantages, and document the new workflow for converting and uploading schemas. Added a comparison table, migration instructions, and API usage details. [1] [2] [3] [4]
  • Added a new script scripts/py_schema_to_json.py to convert legacy Pydantic .py schemas into JSON Schema documents for safer uploads.

Deployment and registration workflow:

  • Modified deployment scripts (post_deployment.ps1 and post_deployment.sh) to detect schema file extensions and set the appropriate MIME type when uploading schemas (application/json for .json, text/x-python for .py). Unsupported extensions are skipped. [1] [2] [3]

Backend and data model changes:

  • Extended the Schema model in src/libs/pipeline/entities/schema.py to include a new Format field, distinguishing between "python" (legacy) and "json" (recommended) schema formats. [1] [2]
  • Updated the map handler (map_handler.py) to load and materialize schemas differently depending on their format: JSON schemas are parsed as data, while Python schemas are imported and executed as before. [1] [2]

Dependencies:

  • Added jsonschema to requirements.txt to support JSON Schema validation and handling.

Does this introduce a breaking change?

  • Yes
  • No

Golden Path Validation

  • I have tested the primary workflows (the "golden path") to ensure they function correctly without errors.

Deployment Validation

  • I have validated the deployment process successfully and all services are running as expected with this change.

What to Check

Verify that the following are valid

  • ...

Other Information

JSON Schema Migration added 3 commits April 28, 2026 10:05
Adds a parallel JSON Schema upload path so schemas can be authored as data instead of executable Python. The worker materialises Pydantic models from JSON in memory (no exec) via the new remote_schema_loader. Legacy .py uploads continue to work unchanged. M1 of the migration plan.
…_schema.py supports .json

- Adds damagedcarimage.json, policereport.json, repairestimate.json (generated via scripts/py_schema_to_json.py).
- register_schema.py now picks the correct content-type per extension (.py -> text/x-python, .json -> application/json).
- Manifest unchanged for now; flip to .json files when ready to deprecate the legacy Python path.
…json schemas

- schema_info.json manifest now lists *.json files (was *.py).
- post_deployment.sh and post_deployment.ps1 derive multipart Content-Type per file extension (.json -> application/json, .py -> text/x-python).
- test_http/schema_API.http examples updated to upload .json samples.
- docs/CustomizeSchemaData.md sample table, mermaid diagram, and manifest example refer to .json files.
- register_schema.py docstring example updated.

Legacy .py uploads still work end-to-end; the change just flips the default authored format.
Comment thread src/ContentProcessor/src/libs/utils/remote_schema_loader.py Fixed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds first-class support for registering and using JSON Schema (.json) schema artifacts alongside legacy executable Python (.py) schemas, updating the API, worker, deployment tooling, samples, and docs to enable a safer “data-only” schema workflow.

Changes:

  • Add JSON Schema upload validation + class-name derivation in the Schema Vault API, and persist a new Format field (python | json) in schema metadata.
  • Update the worker map handler to materialize JSON Schema into in-memory Pydantic models (no code execution) while preserving legacy Python loading.
  • Refresh deployment/scripts/docs/samples/tests to use .json schemas by default and to upload the correct MIME types.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/ContentProcessorAPI/test_http/schema_API.http Updates REST client examples to upload .json schema files.
src/ContentProcessorAPI/samples/schemas/schema_info.json Points sample manifest to .json schema artifacts.
src/ContentProcessorAPI/samples/schemas/autoclaim.json Adds JSON Schema version of the autoclaim sample.
src/ContentProcessorAPI/samples/schemas/damagedcarimage.json Adds JSON Schema version of the damaged-car-image sample.
src/ContentProcessorAPI/samples/schemas/policereport.json Adds JSON Schema version of the policereport sample.
src/ContentProcessorAPI/samples/schemas/repairestimate.json Adds JSON Schema version of the repairestimate sample.
src/ContentProcessorAPI/samples/schemas/register_schema.py Updates schema registration helper to handle .json + correct MIME types.
src/ContentProcessorAPI/requirements.txt Adds jsonschema dependency for server-side JSON Schema validation.
src/ContentProcessorAPI/app/tests/routers/test_schemavault.py Extends router tests to cover JSON upload/update paths and legacy .py acceptance.
src/ContentProcessorAPI/app/tests/logics/test_schema_validator.py Adds unit tests for JSON schema validation and class-name derivation.
src/ContentProcessorAPI/app/routers/schemavault.py Adds extension/size validation, JSON schema validation path, and Format/ContentType handling.
src/ContentProcessorAPI/app/routers/models/schmavault/model.py Extends API Schema model with Format: Literal['python','json'].
src/ContentProcessorAPI/app/routers/logics/schemavault.py Extends update logic to persist Format (and continues updating metadata).
src/ContentProcessorAPI/app/routers/logics/schema_validator.py Introduces JSON Schema validator + extension keyword allowlist.
src/ContentProcessor/tests/unit/utils/test_remote_schema_loader.py Adds unit tests for JSON-schema-to-Pydantic model translation and a golden sample check.
src/ContentProcessor/src/libs/utils/remote_schema_loader.py Adds safe JSON-schema-based loader that builds Pydantic models without executing code.
src/ContentProcessor/src/libs/pipeline/handlers/map_handler.py Switches schema loading based on Schema.Format (json vs python).
src/ContentProcessor/src/libs/pipeline/entities/schema.py Adds Format field to worker-side Schema entity.
src/ContentProcessor/requirements.txt Adds jsonschema dependency to the worker requirements.
scripts/py_schema_to_json.py Adds a local conversion helper from legacy .py Pydantic models to .json schema.
infra/scripts/post_deployment.sh Uploads .json schemas with application/json and .py with text/x-python.
infra/scripts/post_deployment.ps1 Same as above for PowerShell deployments.
docs/CustomizeSchemaData.md Updates docs to recommend JSON Schema, documents workflow, and updates sample references.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 87 to 90
schema_object.ClassName = class_name
schema_object.ContentType = file.content_type
schema_object.Format = storage_format
schema_object.Updated_On = result["date"]
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schemas.Update() assigns schema_object.ContentType = file.content_type, but UploadFile.content_type can be None (or inaccurate). Because Schema.ContentType is typed as str, this can result in null being written to Cosmos and later breaking schema reads/serialization. Use a deterministic fallback based on storage_format (e.g., application/json vs text/x-python) and/or preserve the existing ContentType when the upload omits it.

Copilot uses AI. Check for mistakes.
Comment on lines +169 to +173
content_type = file.content_type or "application/json"
else:
class_name = data.ClassName
storage_format = "python"
content_type = file.content_type or "text/x-python"
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For .json uploads, content_type is taken from the client (file.content_type) when provided. This can store the wrong MIME type in Cosmos and later return an incorrect Content-Type on downloads, even though the server already knows the schema is JSON vs Python from the extension. Consider overriding to application/json for .json (and text/x-python for .py) rather than trusting the client header.

Suggested change
content_type = file.content_type or "application/json"
else:
class_name = data.ClassName
storage_format = "python"
content_type = file.content_type or "text/x-python"
content_type = "application/json"
else:
class_name = data.ClassName
storage_format = "python"
content_type = "text/x-python"

Copilot uses AI. Check for mistakes.
Comment on lines +224 to +244
safe_filename, extension = _validate_upload(file)

if extension == ".json":
raw = file.file.read()
file.file.seek(0)
try:
document = validate_json_schema(raw)
except SchemaValidationError as exc:
raise HTTPException(
status_code=400,
detail={"message": "Invalid JSON schema.", "errors": exc.errors},
) from exc
fallback = os.path.splitext(safe_filename)[0]
class_name = derive_class_name(document, fallback=data.ClassName or fallback)
storage_format = "json"
else:
class_name = data.ClassName
storage_format = "python"

schemas: Schemas = app.app_context.get_service(Schemas)
return schemas.Update(file, data.SchemaId, data.ClassName)
return schemas.Update(file, data.SchemaId, class_name, storage_format)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update_Schema validates the uploaded filename/extension to decide storage_format, but the underlying Schemas.Update() call cannot update FileName (and overwrites the existing blob by its old name). This can leave records where Format == "json" but FileName still ends with .py, which is confusing for downloads and tooling that infers format from filename. Either persist the sanitized filename (and handle blob rename/copy), or document/enforce that updates do not change filenames/extensions (e.g., require the uploaded extension to match the existing schema’s stored filename/format).

Copilot uses AI. Check for mistakes.
idna==3.11
iniconfig==2.3.0
isodate==0.7.2
jsonschema==4.25.1
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jsonschema is added to the ContentProcessor service requirements, but there are no imports/uses of the jsonschema package in the ContentProcessor codebase (the JSON Schema validation logic lives in the API). If the worker does not validate schemas, consider removing this dependency to reduce image size and supply-chain surface area, or add the corresponding worker-side usage.

Suggested change
jsonschema==4.25.1

Copilot uses AI. Check for mistakes.
Comment on lines +331 to +335
- Two project-specific extension keywords are accepted:
- `x-cps-extract-prompt` — optional override for the LLM extraction
prompt for that field.
- `x-cps-required-on-save` — marks a field that must be present in
the LLM output before persistence.
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section documents x-cps-extract-prompt and x-cps-required-on-save as having runtime effects (prompt override / persistence gating), but the current codebase only allowlists these keys during upload validation and does not read them anywhere in the pipeline. Either implement the described behavior or clarify in the docs that these keywords are currently validated/allowed but not yet consumed by the worker/runtime.

Suggested change
- Two project-specific extension keywords are accepted:
- `x-cps-extract-prompt` — optional override for the LLM extraction
prompt for that field.
- `x-cps-required-on-save` — marks a field that must be present in
the LLM output before persistence.
- Two project-specific extension keywords are currently accepted during
upload validation:
- `x-cps-extract-prompt` — allowed in the uploaded schema, but not
currently consumed by the worker/runtime as a per-field prompt
override.
- `x-cps-required-on-save` — allowed in the uploaded schema, but not
currently consumed by the worker/runtime to gate persistence.

Copilot uses AI. Check for mistakes.
Container image was failing at import time with ModuleNotFoundError: 'jsonschema'.
The Dockerfile installs from uv.lock via 'uv sync --frozen', so requirements.txt alone was not enough; the dep had to land in pyproject.toml + uv.lock.

ContentProcessorAPI: adds jsonschema (+ specifications, referencing, rpds-py).
ContentProcessor: pins jsonschema to 4.25.1 (was a 4.26.0 transitive).
BREAKING CHANGE: schema vault no longer accepts Python (.py) schema files.

- API rejects .py uploads with HTTP 415; only .json (JSON Schema Draft 2020-12) is accepted.
- Worker (map_handler) refuses to process schemas with Format='python'; existing Cosmos records must be re-registered as JSON.
- Deleted libs/utils/remote_module_loader.py (the exec/importlib loader that was the original RCE primitive).
- Deleted sample .py schemas; .json equivalents have been the default since the previous commit.
- register_schema.py, post_deployment.sh/ps1, .http examples, and CustomizeSchemaData.md all updated to JSON-only.
- Schema model defaults Format to 'json'; API model Literal restricted to 'json' only.
- Test suite updated: previous .py-accepting tests now assert .py is rejected.

import json
import logging
from typing import Any, ForwardRef, List, Literal, Optional, Tuple, Type, Union
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants