feat: Support JSON Schema uploads and update related deployment scripts by Prajwal-Microsoft · Pull Request #566 · microsoft/content-processing-solution-accelerator

Prajwal-Microsoft · 2026-04-28T04:49:27Z

Purpose

This pull request introduces support for authoring and managing schemas as JSON Schema documents in addition to the legacy Python (.py) format. The changes improve security by allowing schemas to be uploaded as pure data (JSON), which are parsed and materialized in memory without executing any uploaded code. The documentation, deployment scripts, and backend logic have been updated to support this new recommended workflow, while maintaining backward compatibility with existing Python-based schemas.

Schema authoring and documentation improvements:

Updated docs/CustomizeSchemaData.md to recommend JSON Schema as the preferred format, explain its advantages, and document the new workflow for converting and uploading schemas. Added a comparison table, migration instructions, and API usage details. [1] [2] [3] [4]
Added a new script scripts/py_schema_to_json.py to convert legacy Pydantic .py schemas into JSON Schema documents for safer uploads.

Deployment and registration workflow:

Modified deployment scripts (post_deployment.ps1 and post_deployment.sh) to detect schema file extensions and set the appropriate MIME type when uploading schemas (application/json for .json, text/x-python for .py). Unsupported extensions are skipped. [1] [2] [3]

Backend and data model changes:

Extended the Schema model in src/libs/pipeline/entities/schema.py to include a new Format field, distinguishing between "python" (legacy) and "json" (recommended) schema formats. [1] [2]
Updated the map handler (map_handler.py) to load and materialize schemas differently depending on their format: JSON schemas are parsed as data, while Python schemas are imported and executed as before. [1] [2]

Dependencies:

Added jsonschema to requirements.txt to support JSON Schema validation and handling.

Does this introduce a breaking change?

Yes
No

Golden Path Validation

I have tested the primary workflows (the "golden path") to ensure they function correctly without errors.

Deployment Validation

I have validated the deployment process successfully and all services are running as expected with this change.

What to Check

Verify that the following are valid

...

Other Information

Adds a parallel JSON Schema upload path so schemas can be authored as data instead of executable Python. The worker materialises Pydantic models from JSON in memory (no exec) via the new remote_schema_loader. Legacy .py uploads continue to work unchanged. M1 of the migration plan.

…_schema.py supports .json - Adds damagedcarimage.json, policereport.json, repairestimate.json (generated via scripts/py_schema_to_json.py). - register_schema.py now picks the correct content-type per extension (.py -> text/x-python, .json -> application/json). - Manifest unchanged for now; flip to .json files when ready to deprecate the legacy Python path.

…json schemas - schema_info.json manifest now lists *.json files (was *.py). - post_deployment.sh and post_deployment.ps1 derive multipart Content-Type per file extension (.json -> application/json, .py -> text/x-python). - test_http/schema_API.http examples updated to upload .json samples. - docs/CustomizeSchemaData.md sample table, mermaid diagram, and manifest example refer to .json files. - register_schema.py docstring example updated. Legacy .py uploads still work end-to-end; the change just flips the default authored format.

Copilot

Pull request overview

This PR adds first-class support for registering and using JSON Schema (.json) schema artifacts alongside legacy executable Python (.py) schemas, updating the API, worker, deployment tooling, samples, and docs to enable a safer “data-only” schema workflow.

Changes:

Add JSON Schema upload validation + class-name derivation in the Schema Vault API, and persist a new Format field (python | json) in schema metadata.
Update the worker map handler to materialize JSON Schema into in-memory Pydantic models (no code execution) while preserving legacy Python loading.
Refresh deployment/scripts/docs/samples/tests to use .json schemas by default and to upload the correct MIME types.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/ContentProcessorAPI/test_http/schema_API.http	Updates REST client examples to upload `.json` schema files.
src/ContentProcessorAPI/samples/schemas/schema_info.json	Points sample manifest to `.json` schema artifacts.
src/ContentProcessorAPI/samples/schemas/autoclaim.json	Adds JSON Schema version of the autoclaim sample.
src/ContentProcessorAPI/samples/schemas/damagedcarimage.json	Adds JSON Schema version of the damaged-car-image sample.
src/ContentProcessorAPI/samples/schemas/policereport.json	Adds JSON Schema version of the policereport sample.
src/ContentProcessorAPI/samples/schemas/repairestimate.json	Adds JSON Schema version of the repairestimate sample.
src/ContentProcessorAPI/samples/schemas/register_schema.py	Updates schema registration helper to handle `.json` + correct MIME types.
src/ContentProcessorAPI/requirements.txt	Adds `jsonschema` dependency for server-side JSON Schema validation.
src/ContentProcessorAPI/app/tests/routers/test_schemavault.py	Extends router tests to cover JSON upload/update paths and legacy `.py` acceptance.
src/ContentProcessorAPI/app/tests/logics/test_schema_validator.py	Adds unit tests for JSON schema validation and class-name derivation.
src/ContentProcessorAPI/app/routers/schemavault.py	Adds extension/size validation, JSON schema validation path, and Format/ContentType handling.
src/ContentProcessorAPI/app/routers/models/schmavault/model.py	Extends API Schema model with `Format: Literal['python','json']`.
src/ContentProcessorAPI/app/routers/logics/schemavault.py	Extends update logic to persist `Format` (and continues updating metadata).
src/ContentProcessorAPI/app/routers/logics/schema_validator.py	Introduces JSON Schema validator + extension keyword allowlist.
src/ContentProcessor/tests/unit/utils/test_remote_schema_loader.py	Adds unit tests for JSON-schema-to-Pydantic model translation and a golden sample check.
src/ContentProcessor/src/libs/utils/remote_schema_loader.py	Adds safe JSON-schema-based loader that builds Pydantic models without executing code.
src/ContentProcessor/src/libs/pipeline/handlers/map_handler.py	Switches schema loading based on `Schema.Format` (`json` vs `python`).
src/ContentProcessor/src/libs/pipeline/entities/schema.py	Adds `Format` field to worker-side Schema entity.
src/ContentProcessor/requirements.txt	Adds `jsonschema` dependency to the worker requirements.
scripts/py_schema_to_json.py	Adds a local conversion helper from legacy `.py` Pydantic models to `.json` schema.
infra/scripts/post_deployment.sh	Uploads `.json` schemas with `application/json` and `.py` with `text/x-python`.
infra/scripts/post_deployment.ps1	Same as above for PowerShell deployments.
docs/CustomizeSchemaData.md	Updates docs to recommend JSON Schema, documents workflow, and updates sample references.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-28T05:57:26Z

        schema_object.ClassName = class_name
        schema_object.ContentType = file.content_type
+        schema_object.Format = storage_format
        schema_object.Updated_On = result["date"]


Schemas.Update() assigns schema_object.ContentType = file.content_type, but UploadFile.content_type can be None (or inaccurate). Because Schema.ContentType is typed as str, this can result in null being written to Cosmos and later breaking schema reads/serialization. Use a deterministic fallback based on storage_format (e.g., application/json vs text/x-python) and/or preserve the existing ContentType when the upload omits it.

Copilot · 2026-04-28T05:57:26Z

+        content_type = file.content_type or "application/json"
+    else:
+        class_name = data.ClassName
+        storage_format = "python"
+        content_type = file.content_type or "text/x-python"


For .json uploads, content_type is taken from the client (file.content_type) when provided. This can store the wrong MIME type in Cosmos and later return an incorrect Content-Type on downloads, even though the server already knows the schema is JSON vs Python from the extension. Consider overriding to application/json for .json (and text/x-python for .py) rather than trusting the client header.

Suggested change

content_type = file.content_type or "application/json"

else:

class_name = data.ClassName

storage_format = "python"

content_type = file.content_type or "text/x-python"

content_type = "application/json"

else:

class_name = data.ClassName

storage_format = "python"

content_type = "text/x-python"

Copilot · 2026-04-28T05:57:27Z

+    safe_filename, extension = _validate_upload(file)
+
+    if extension == ".json":
+        raw = file.file.read()
+        file.file.seek(0)
+        try:
+            document = validate_json_schema(raw)
+        except SchemaValidationError as exc:
+            raise HTTPException(
+                status_code=400,
+                detail={"message": "Invalid JSON schema.", "errors": exc.errors},
+            ) from exc
+        fallback = os.path.splitext(safe_filename)[0]
+        class_name = derive_class_name(document, fallback=data.ClassName or fallback)
+        storage_format = "json"
+    else:
+        class_name = data.ClassName
+        storage_format = "python"

    schemas: Schemas = app.app_context.get_service(Schemas)
-    return schemas.Update(file, data.SchemaId, data.ClassName)
+    return schemas.Update(file, data.SchemaId, class_name, storage_format)


Update_Schema validates the uploaded filename/extension to decide storage_format, but the underlying Schemas.Update() call cannot update FileName (and overwrites the existing blob by its old name). This can leave records where Format == "json" but FileName still ends with .py, which is confusing for downloads and tooling that infers format from filename. Either persist the sanitized filename (and handle blob rename/copy), or document/enforce that updates do not change filenames/extensions (e.g., require the uploaded extension to match the existing schema’s stored filename/format).

Copilot · 2026-04-28T05:57:27Z

 idna==3.11
 iniconfig==2.3.0
 isodate==0.7.2
+jsonschema==4.25.1


jsonschema is added to the ContentProcessor service requirements, but there are no imports/uses of the jsonschema package in the ContentProcessor codebase (the JSON Schema validation logic lives in the API). If the worker does not validate schemas, consider removing this dependency to reduce image size and supply-chain surface area, or add the corresponding worker-side usage.

Suggested change

jsonschema==4.25.1

Copilot · 2026-04-28T05:57:27Z

+- Two project-specific extension keywords are accepted:
+  - `x-cps-extract-prompt` — optional override for the LLM extraction
+    prompt for that field.
+  - `x-cps-required-on-save` — marks a field that must be present in
+    the LLM output before persistence.


This section documents x-cps-extract-prompt and x-cps-required-on-save as having runtime effects (prompt override / persistence gating), but the current codebase only allowlists these keys during upload validation and does not read them anywhere in the pipeline. Either implement the described behavior or clarify in the docs that these keywords are currently validated/allowed but not yet consumed by the worker/runtime.

Suggested change

- Two project-specific extension keywords are accepted:

- `x-cps-extract-prompt` — optional override for the LLM extraction

prompt for that field.

- `x-cps-required-on-save` — marks a field that must be present in

the LLM output before persistence.

- Two project-specific extension keywords are currently accepted during

upload validation:

- `x-cps-extract-prompt` — allowed in the uploaded schema, but not

currently consumed by the worker/runtime as a per-field prompt

override.

- `x-cps-required-on-save` — allowed in the uploaded schema, but not

currently consumed by the worker/runtime to gate persistence.

Container image was failing at import time with ModuleNotFoundError: 'jsonschema'. The Dockerfile installs from uv.lock via 'uv sync --frozen', so requirements.txt alone was not enough; the dep had to land in pyproject.toml + uv.lock. ContentProcessorAPI: adds jsonschema (+ specifications, referencing, rpds-py). ContentProcessor: pins jsonschema to 4.25.1 (was a 4.26.0 transitive).

BREAKING CHANGE: schema vault no longer accepts Python (.py) schema files. - API rejects .py uploads with HTTP 415; only .json (JSON Schema Draft 2020-12) is accepted. - Worker (map_handler) refuses to process schemas with Format='python'; existing Cosmos records must be re-registered as JSON. - Deleted libs/utils/remote_module_loader.py (the exec/importlib loader that was the original RCE primitive). - Deleted sample .py schemas; .json equivalents have been the default since the previous commit. - register_schema.py, post_deployment.sh/ps1, .http examples, and CustomizeSchemaData.md all updated to JSON-only. - Schema model defaults Format to 'json'; API model Literal restricted to 'json' only. - Test suite updated: previous .py-accepting tests now assert .py is rejected.

+
+import json
+import logging
+from typing import Any, ForwardRef, List, Literal, Optional, Tuple, Type, Union


JSON Schema Migration added 3 commits April 28, 2026 10:05

Prajwal-Microsoft requested review from Avijit-Microsoft, Roopan-Microsoft, Vinay-Microsoft, aniaroramsft, dgp10801, nchandhi and toherman-msft as code owners April 28, 2026 04:49

Prajwal-Microsoft temporarily deployed to production April 28, 2026 04:49 — with GitHub Actions Inactive

github-code-quality Bot found potential problems Apr 28, 2026

View reviewed changes

Comment thread src/ContentProcessor/src/libs/utils/remote_schema_loader.py Fixed

Prajwal-Microsoft temporarily deployed to production April 28, 2026 05:40 — with GitHub Actions Inactive

Prajwal-Microsoft temporarily deployed to production April 28, 2026 05:45 — with GitHub Actions Inactive

Prajwal-Microsoft requested a review from Copilot April 28, 2026 05:49

Copilot started reviewing on behalf of Prajwal-Microsoft April 28, 2026 05:50 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

Prajwal-Microsoft had a problem deploying to production April 28, 2026 06:36 — with GitHub Actions Failure

Prajwal-Microsoft temporarily deployed to production April 28, 2026 06:39 — with GitHub Actions Inactive

Prajwal-Microsoft temporarily deployed to production April 28, 2026 06:53 — with GitHub Actions Inactive

Prajwal-Microsoft temporarily deployed to production April 28, 2026 06:57 — with GitHub Actions Inactive

Prajwal-Microsoft temporarily deployed to production April 28, 2026 06:58 — with GitHub Actions Inactive

Prajwal-Microsoft temporarily deployed to production April 28, 2026 07:10 — with GitHub Actions Inactive

Prajwal-Microsoft temporarily deployed to production April 28, 2026 07:29 — with GitHub Actions Inactive

github-code-quality Bot found potential problems Apr 28, 2026

View reviewed changes

Comment thread src/ContentProcessor/src/libs/utils/remote_schema_loader.py

import json

import logging

from typing import Any, ForwardRef, List, Literal, Optional, Tuple, Type, Union

Prajwal-Microsoft temporarily deployed to production April 28, 2026 07:41 — with GitHub Actions Inactive

Prekshith-Microsoft temporarily deployed to production April 28, 2026 11:33 — with GitHub Actions Inactive

Prekshith-Microsoft temporarily deployed to production April 28, 2026 11:38 — with GitHub Actions Inactive

Prekshith-Microsoft temporarily deployed to production April 28, 2026 12:03 — with GitHub Actions Inactive

Prekshith-Microsoft temporarily deployed to production April 29, 2026 05:40 — with GitHub Actions Inactive

Prekshith-Microsoft temporarily deployed to production April 29, 2026 05:46 — with GitHub Actions Inactive

Prekshith-Microsoft temporarily deployed to production April 29, 2026 05:56 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support JSON Schema uploads and update related deployment scripts#566

feat: Support JSON Schema uploads and update related deployment scripts#566
Prajwal-Microsoft wants to merge 5 commits intomainfrom
feature/json-schema-support

Prajwal-Microsoft commented Apr 28, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-- Two project-specific extension keywords are accepted:
-  - `x-cps-extract-prompt` — optional override for the LLM extraction
-    prompt for that field.
-  - `x-cps-required-on-save` — marks a field that must be present in
-    the LLM output before persistence.
+- Two project-specific extension keywords are currently accepted during
+  upload validation:
+  - `x-cps-extract-prompt` — allowed in the uploaded schema, but not
+    currently consumed by the worker/runtime as a per-field prompt
+    override.
+  - `x-cps-required-on-save` — allowed in the uploaded schema, but not
+    currently consumed by the worker/runtime to gate persistence.

Conversation

Prajwal-Microsoft commented Apr 28, 2026

Purpose

Does this introduce a breaking change?

Golden Path Validation

Deployment Validation

What to Check

Other Information

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants