π₯ Demo Video β’ π Research Paper β’ π Quick Start
Note: This demo video is from the use case with DigitalGreen.org where MMCT is being used. The app code shown in the video is not available in this repository.
MMCTAgent is a state-of-the-art multi-modal AI framework that brings human-like critical thinking to visual reasoning tasks. it combines advanced planning, self-critique, and tool-based reasoning to deliver superior performance in complex image and video understanding applications.
- π§ Self Reflection Framework: MMCTAgent emulates iteratively analyzing multi-modal information, decomposing complex queries, planning strategies, and dynamically evolving its reasoning. Designed as a research framework, MMCTAgent integrates critical thinking elements such as verification of final answers and self-reflection through a novel approach that defines a vision-based critic and identifies task-specific evaluation criteria, thereby enhancing its decision-making abilities.
- π¬ Enables Querying over Multimodal Collections: It enables modular design to plug-in right audio, visual extraction and processing tools, combined with Multimodal LLMs to ingest and query over large number of videos and image data.
- π Easy Integration: Its modular design allows for easy integration into existing workflows and adding domain-specific tools, facilitating adoption across various domains requiring advanced visual reasoning capabilities.
- π Dual Orchestration for Video:
- Graph Agent (Swarm): An autonomous multi-agent team (Planner, Video, Image) that agentically traverses knowledge graphs for broad discovery and reasoning.
- Graph State (Pipeline): A deterministic state-machine workflow designed for high-precision, reproducible, and efficient retrieval.
- πΌοΈ Agentic Image Reasoning: A dedicated Multi-Agent Image Pipeline featuring specialized tools for OCR, Object Detection, Scene Recognition, and ViT-based reasoning.
- π Structured Video Ingestion: End-to-end pipeline to transform raw video into a queryable temporal knowledge graph (Neo4j) with automated scene segmentation and metadata extraction.
- π§ Provider Agnostic: Native support for Azure OpenAI, Neo4j, and Azure Blob Storage, with a modular interface for custom LLM and VectorDB backends.
- π MCP Native: Built-in Model Context Protocol (MCP) server support for seamless integration with external AI clients.
-
To get started with MMCTAgent, clone the repository and install the dependencies:
git clone https://github.com/microsoft/MMCTAgent.git cd MMCTAgent -
System Dependencies
Install FFmpeg
Linux/Ubuntu:
sudo apt-get update sudo apt-get install ffmpeg libsm6 libxext6 -y
Windows:
- Download FFmpeg from ffmpeg.org
- Add the
binfolder to your system PATH
-
Python Environment Setup
Option A: Using Conda (Recommended)
conda create -n mmct-agent python=3.11 conda activate mmct-agent
Option B: Using venv
python -m venv mmct-agent # Linux/Mac source mmct-agent/bin/activate # Windows mmct-agent\Scripts\activate.bat
-
Install Dependencies
pip install --upgrade pip pip install . -
Quick Start Examples
Transform a video file into a structured knowledge graph (Neo4j).
import asyncio
from mmct.video_pipeline import IngestionPipeline, Languages
# Import your provider configuration or use defaults
from config.provider_config import get_ingestion_providers
async def run_ingestion():
providers = get_ingestion_providers()
ingestion = IngestionPipeline(
video_path="path/to/video.mp4",
video_id="my-unique-video-id",
language=Languages.ENGLISH_INDIA,
provider=providers
)
report = await ingestion.run()
print(f"Ingestion {report.status}: {report.summary()}")
asyncio.run(run_ingestion())Extend the ingestion pipeline with your own processing steps. Import
PipelineStep, register_step, and the supporting types from the public
API, implement your step class, and reference it in a custom pipeline YAML:
from mmct.video_pipeline import PipelineStep, StepContext, StepResult, register_step
@register_step("myapp.watermark")
class WatermarkStep(PipelineStep):
step_type = "myapp.watermark"
description = "Burn a watermark onto extracted frames."
async def run(self, context: StepContext) -> StepResult:
# ... your logic here ...
return StepResult(step_id=self.step_id, outputs={"watermarked": True})Import your step module before running the pipeline so the decorator
registers the class, then point IngestionPipeline at a YAML config that
includes the step. See scripts/custom_steps/
for a complete working example.
Query your video knowledge graph using either agentic swarm or deterministic state modes.
from mmct.video_pipeline.query_pipeline import VideoQueryPipeline, QueryPipelineMode
import asyncio
async def query_video():
# Initialize the unified pipeline
# use_provider_defaults=True automatically hydrates providers from config
pipeline = VideoQueryPipeline(
mode=QueryPipelineMode.GRAPH_STATE, # or GRAPH_AGENT for swarm
use_provider_defaults=True,
use_critic=True
)
# Perform a natural language query
result = await pipeline.query(
user_query="What happens after the car enters the frame?",
video_id="my-unique-video-id"
)
print(f"Answer: {result['answer']}")
await pipeline.close()
asyncio.run(query_video())Analyze individual images using specialized vision tools.
from mmct.image_pipeline.agents.image_agent import ImageAgent, ImageQnaTools
import asyncio
async def query_image():
agent = ImageAgent(
image_path="path/to/image.jpg",
query="What labels are on the medicine bottles?",
use_critic_agent=True,
tools=[ImageQnaTools.ocr, ImageQnaTools.vit]
)
response = await agent()
print(f"Analysis: {response}")
asyncio.run(query_image())Expose MMCT pipelines as tools for model-based agents.
# Start the server
PYTHONPATH=. python3 mcp_server/main.pyAccess the server at http://0.0.0.0:8000/mcp. See the mcp_server/ directory for more details.
MMCTAgent features a modular provider system that allows you to switch between different cloud providers and AI services seamlessly.
| Service Type | Supported Providers | Use Cases |
|---|---|---|
| LLM | Azure OpenAI + Custom | Reasoning, chat completion, planning |
| Search/DB | Neo4j | Graph traversal, vector search |
| Transcription | Azure Speech, OpenAI Whisper | Audio-to-text conversion |
| Storage | Azure Blob Storage, Local | Media and artifact management |
Note: For detailed implementation, see the Providers Guide.
MMCTAgent
βββ examples/ # π‘ Comprehensive Jupyter notebook examples
βββ mcp_server/ # π Model Context Protocol server entry points
βββ mmct/ # π§ Core Framework
β βββ image_pipeline/ # Multi-Agent image reasoning and vision tools
β βββ video_pipeline/ # End-to-end video intelligence
β β βββ core/ # Ingestion, Graph structure, and Registry
β β βββ graph_agent/ # Swarm-based (Agentic) orchestration
β β βββ graph_state/ # State-Machine (Deterministic) orchestration
β β βββ query_pipeline.py # Unified video query entry point
β βββ providers/ # Modular backend service interfaces
β βββ utils/ # Error handling and shared utilities
βββ scripts/ # π οΈ Dev scripts and custom step examples
β βββ custom_steps/ # Example custom ingestion steps
βββ api/ # π FastAPI web application
βββ config/ # βοΈ Centralized configuration and hydration
βββ infra/ # ποΈ Azure Infrastructure deployment guides
βββ README.mdThis project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com. When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repositories using our CLA. This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
Note: This project is currently under active research and continuous development. While contributions are encouraged, please note that the codebase may evolve as the project matures.
If you find MMCTAgent useful in your research, please cite:
@article{kumar2024mmctagent,
title={MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning},
author={Kumar, Somnath and Gadhia, Yash and Ganu, Tanuja and Nambi, Akshay},
conference={NeurIPS OWA-2024},
year={2024},
url={https://www.microsoft.com/en-us/research/publication/mmctagent-multi-modal-critical-thinking-agent-framework-for-complex-visual-reasoning}
}Licensed under the MIT License.



