feat: Support JSON Schema uploads and update related deployment scripts#566
feat: Support JSON Schema uploads and update related deployment scripts#566Prajwal-Microsoft wants to merge 5 commits intomainfrom
Conversation
Adds a parallel JSON Schema upload path so schemas can be authored as data instead of executable Python. The worker materialises Pydantic models from JSON in memory (no exec) via the new remote_schema_loader. Legacy .py uploads continue to work unchanged. M1 of the migration plan.
…_schema.py supports .json - Adds damagedcarimage.json, policereport.json, repairestimate.json (generated via scripts/py_schema_to_json.py). - register_schema.py now picks the correct content-type per extension (.py -> text/x-python, .json -> application/json). - Manifest unchanged for now; flip to .json files when ready to deprecate the legacy Python path.
…json schemas - schema_info.json manifest now lists *.json files (was *.py). - post_deployment.sh and post_deployment.ps1 derive multipart Content-Type per file extension (.json -> application/json, .py -> text/x-python). - test_http/schema_API.http examples updated to upload .json samples. - docs/CustomizeSchemaData.md sample table, mermaid diagram, and manifest example refer to .json files. - register_schema.py docstring example updated. Legacy .py uploads still work end-to-end; the change just flips the default authored format.
There was a problem hiding this comment.
Pull request overview
This PR adds first-class support for registering and using JSON Schema (.json) schema artifacts alongside legacy executable Python (.py) schemas, updating the API, worker, deployment tooling, samples, and docs to enable a safer “data-only” schema workflow.
Changes:
- Add JSON Schema upload validation + class-name derivation in the Schema Vault API, and persist a new
Formatfield (python|json) in schema metadata. - Update the worker map handler to materialize JSON Schema into in-memory Pydantic models (no code execution) while preserving legacy Python loading.
- Refresh deployment/scripts/docs/samples/tests to use
.jsonschemas by default and to upload the correct MIME types.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ContentProcessorAPI/test_http/schema_API.http | Updates REST client examples to upload .json schema files. |
| src/ContentProcessorAPI/samples/schemas/schema_info.json | Points sample manifest to .json schema artifacts. |
| src/ContentProcessorAPI/samples/schemas/autoclaim.json | Adds JSON Schema version of the autoclaim sample. |
| src/ContentProcessorAPI/samples/schemas/damagedcarimage.json | Adds JSON Schema version of the damaged-car-image sample. |
| src/ContentProcessorAPI/samples/schemas/policereport.json | Adds JSON Schema version of the policereport sample. |
| src/ContentProcessorAPI/samples/schemas/repairestimate.json | Adds JSON Schema version of the repairestimate sample. |
| src/ContentProcessorAPI/samples/schemas/register_schema.py | Updates schema registration helper to handle .json + correct MIME types. |
| src/ContentProcessorAPI/requirements.txt | Adds jsonschema dependency for server-side JSON Schema validation. |
| src/ContentProcessorAPI/app/tests/routers/test_schemavault.py | Extends router tests to cover JSON upload/update paths and legacy .py acceptance. |
| src/ContentProcessorAPI/app/tests/logics/test_schema_validator.py | Adds unit tests for JSON schema validation and class-name derivation. |
| src/ContentProcessorAPI/app/routers/schemavault.py | Adds extension/size validation, JSON schema validation path, and Format/ContentType handling. |
| src/ContentProcessorAPI/app/routers/models/schmavault/model.py | Extends API Schema model with Format: Literal['python','json']. |
| src/ContentProcessorAPI/app/routers/logics/schemavault.py | Extends update logic to persist Format (and continues updating metadata). |
| src/ContentProcessorAPI/app/routers/logics/schema_validator.py | Introduces JSON Schema validator + extension keyword allowlist. |
| src/ContentProcessor/tests/unit/utils/test_remote_schema_loader.py | Adds unit tests for JSON-schema-to-Pydantic model translation and a golden sample check. |
| src/ContentProcessor/src/libs/utils/remote_schema_loader.py | Adds safe JSON-schema-based loader that builds Pydantic models without executing code. |
| src/ContentProcessor/src/libs/pipeline/handlers/map_handler.py | Switches schema loading based on Schema.Format (json vs python). |
| src/ContentProcessor/src/libs/pipeline/entities/schema.py | Adds Format field to worker-side Schema entity. |
| src/ContentProcessor/requirements.txt | Adds jsonschema dependency to the worker requirements. |
| scripts/py_schema_to_json.py | Adds a local conversion helper from legacy .py Pydantic models to .json schema. |
| infra/scripts/post_deployment.sh | Uploads .json schemas with application/json and .py with text/x-python. |
| infra/scripts/post_deployment.ps1 | Same as above for PowerShell deployments. |
| docs/CustomizeSchemaData.md | Updates docs to recommend JSON Schema, documents workflow, and updates sample references. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| schema_object.ClassName = class_name | ||
| schema_object.ContentType = file.content_type | ||
| schema_object.Format = storage_format | ||
| schema_object.Updated_On = result["date"] |
There was a problem hiding this comment.
Schemas.Update() assigns schema_object.ContentType = file.content_type, but UploadFile.content_type can be None (or inaccurate). Because Schema.ContentType is typed as str, this can result in null being written to Cosmos and later breaking schema reads/serialization. Use a deterministic fallback based on storage_format (e.g., application/json vs text/x-python) and/or preserve the existing ContentType when the upload omits it.
| content_type = file.content_type or "application/json" | ||
| else: | ||
| class_name = data.ClassName | ||
| storage_format = "python" | ||
| content_type = file.content_type or "text/x-python" |
There was a problem hiding this comment.
For .json uploads, content_type is taken from the client (file.content_type) when provided. This can store the wrong MIME type in Cosmos and later return an incorrect Content-Type on downloads, even though the server already knows the schema is JSON vs Python from the extension. Consider overriding to application/json for .json (and text/x-python for .py) rather than trusting the client header.
| content_type = file.content_type or "application/json" | |
| else: | |
| class_name = data.ClassName | |
| storage_format = "python" | |
| content_type = file.content_type or "text/x-python" | |
| content_type = "application/json" | |
| else: | |
| class_name = data.ClassName | |
| storage_format = "python" | |
| content_type = "text/x-python" |
| safe_filename, extension = _validate_upload(file) | ||
|
|
||
| if extension == ".json": | ||
| raw = file.file.read() | ||
| file.file.seek(0) | ||
| try: | ||
| document = validate_json_schema(raw) | ||
| except SchemaValidationError as exc: | ||
| raise HTTPException( | ||
| status_code=400, | ||
| detail={"message": "Invalid JSON schema.", "errors": exc.errors}, | ||
| ) from exc | ||
| fallback = os.path.splitext(safe_filename)[0] | ||
| class_name = derive_class_name(document, fallback=data.ClassName or fallback) | ||
| storage_format = "json" | ||
| else: | ||
| class_name = data.ClassName | ||
| storage_format = "python" | ||
|
|
||
| schemas: Schemas = app.app_context.get_service(Schemas) | ||
| return schemas.Update(file, data.SchemaId, data.ClassName) | ||
| return schemas.Update(file, data.SchemaId, class_name, storage_format) |
There was a problem hiding this comment.
Update_Schema validates the uploaded filename/extension to decide storage_format, but the underlying Schemas.Update() call cannot update FileName (and overwrites the existing blob by its old name). This can leave records where Format == "json" but FileName still ends with .py, which is confusing for downloads and tooling that infers format from filename. Either persist the sanitized filename (and handle blob rename/copy), or document/enforce that updates do not change filenames/extensions (e.g., require the uploaded extension to match the existing schema’s stored filename/format).
| idna==3.11 | ||
| iniconfig==2.3.0 | ||
| isodate==0.7.2 | ||
| jsonschema==4.25.1 |
There was a problem hiding this comment.
jsonschema is added to the ContentProcessor service requirements, but there are no imports/uses of the jsonschema package in the ContentProcessor codebase (the JSON Schema validation logic lives in the API). If the worker does not validate schemas, consider removing this dependency to reduce image size and supply-chain surface area, or add the corresponding worker-side usage.
| jsonschema==4.25.1 |
| - Two project-specific extension keywords are accepted: | ||
| - `x-cps-extract-prompt` — optional override for the LLM extraction | ||
| prompt for that field. | ||
| - `x-cps-required-on-save` — marks a field that must be present in | ||
| the LLM output before persistence. |
There was a problem hiding this comment.
This section documents x-cps-extract-prompt and x-cps-required-on-save as having runtime effects (prompt override / persistence gating), but the current codebase only allowlists these keys during upload validation and does not read them anywhere in the pipeline. Either implement the described behavior or clarify in the docs that these keywords are currently validated/allowed but not yet consumed by the worker/runtime.
| - Two project-specific extension keywords are accepted: | |
| - `x-cps-extract-prompt` — optional override for the LLM extraction | |
| prompt for that field. | |
| - `x-cps-required-on-save` — marks a field that must be present in | |
| the LLM output before persistence. | |
| - Two project-specific extension keywords are currently accepted during | |
| upload validation: | |
| - `x-cps-extract-prompt` — allowed in the uploaded schema, but not | |
| currently consumed by the worker/runtime as a per-field prompt | |
| override. | |
| - `x-cps-required-on-save` — allowed in the uploaded schema, but not | |
| currently consumed by the worker/runtime to gate persistence. |
Container image was failing at import time with ModuleNotFoundError: 'jsonschema'. The Dockerfile installs from uv.lock via 'uv sync --frozen', so requirements.txt alone was not enough; the dep had to land in pyproject.toml + uv.lock. ContentProcessorAPI: adds jsonschema (+ specifications, referencing, rpds-py). ContentProcessor: pins jsonschema to 4.25.1 (was a 4.26.0 transitive).
BREAKING CHANGE: schema vault no longer accepts Python (.py) schema files. - API rejects .py uploads with HTTP 415; only .json (JSON Schema Draft 2020-12) is accepted. - Worker (map_handler) refuses to process schemas with Format='python'; existing Cosmos records must be re-registered as JSON. - Deleted libs/utils/remote_module_loader.py (the exec/importlib loader that was the original RCE primitive). - Deleted sample .py schemas; .json equivalents have been the default since the previous commit. - register_schema.py, post_deployment.sh/ps1, .http examples, and CustomizeSchemaData.md all updated to JSON-only. - Schema model defaults Format to 'json'; API model Literal restricted to 'json' only. - Test suite updated: previous .py-accepting tests now assert .py is rejected.
|
|
||
| import json | ||
| import logging | ||
| from typing import Any, ForwardRef, List, Literal, Optional, Tuple, Type, Union |
Purpose
This pull request introduces support for authoring and managing schemas as JSON Schema documents in addition to the legacy Python (
.py) format. The changes improve security by allowing schemas to be uploaded as pure data (JSON), which are parsed and materialized in memory without executing any uploaded code. The documentation, deployment scripts, and backend logic have been updated to support this new recommended workflow, while maintaining backward compatibility with existing Python-based schemas.Schema authoring and documentation improvements:
docs/CustomizeSchemaData.mdto recommend JSON Schema as the preferred format, explain its advantages, and document the new workflow for converting and uploading schemas. Added a comparison table, migration instructions, and API usage details. [1] [2] [3] [4]scripts/py_schema_to_json.pyto convert legacy Pydantic.pyschemas into JSON Schema documents for safer uploads.Deployment and registration workflow:
post_deployment.ps1andpost_deployment.sh) to detect schema file extensions and set the appropriate MIME type when uploading schemas (application/jsonfor.json,text/x-pythonfor.py). Unsupported extensions are skipped. [1] [2] [3]Backend and data model changes:
Schemamodel insrc/libs/pipeline/entities/schema.pyto include a newFormatfield, distinguishing between"python"(legacy) and"json"(recommended) schema formats. [1] [2]map_handler.py) to load and materialize schemas differently depending on their format: JSON schemas are parsed as data, while Python schemas are imported and executed as before. [1] [2]Dependencies:
jsonschematorequirements.txtto support JSON Schema validation and handling.Does this introduce a breaking change?
Golden Path Validation
Deployment Validation
What to Check
Verify that the following are valid
Other Information