MMCTAgent

Multi-Modal Critical Thinking Agent Framework for Complex Visual Reasoning

🎥 Demo Video • 📄 Research Paper • 🚀 Quick Start

▶️ Watch Demo Video

Note: This demo video is from the use case with DigitalGreen.org where MMCT is being used. The app code shown in the video is not available in this repository.

Overview

MMCTAgent is a state-of-the-art multi-modal AI framework that brings human-like critical thinking to visual reasoning tasks. it combines advanced planning, self-critique, and tool-based reasoning to deliver superior performance in complex image and video understanding applications.

Why MMCTAgent?

🧠 Self Reflection Framework: MMCTAgent emulates iteratively analyzing multi-modal information, decomposing complex queries, planning strategies, and dynamically evolving its reasoning. Designed as a research framework, MMCTAgent integrates critical thinking elements such as verification of final answers and self-reflection through a novel approach that defines a vision-based critic and identifies task-specific evaluation criteria, thereby enhancing its decision-making abilities.
🔬 Enables Querying over Multimodal Collections: It enables modular design to plug-in right audio, visual extraction and processing tools, combined with Multimodal LLMs to ingest and query over large number of videos and image data.
🚀 Easy Integration: Its modular design allows for easy integration into existing workflows and adding domain-specific tools, facilitating adoption across various domains requiring advanced visual reasoning capabilities.

Key Features

📊 Dual Orchestration for Video:
- Graph Agent (Swarm): An autonomous multi-agent team (Planner, Video, Image) that agentically traverses knowledge graphs for broad discovery and reasoning.
- Graph State (Pipeline): A deterministic state-machine workflow designed for high-precision, reproducible, and efficient retrieval.
🖼️ Agentic Image Reasoning: A dedicated Multi-Agent Image Pipeline featuring specialized tools for OCR, Object Detection, Scene Recognition, and ViT-based reasoning.
📂 Structured Video Ingestion: End-to-end pipeline to transform raw video into a queryable temporal knowledge graph (Neo4j) with automated scene segmentation and metadata extraction.
🧠 Provider Agnostic: Native support for Azure OpenAI, Neo4j, and Azure Blob Storage, with a modular interface for custom LLM and VectorDB backends.
🔌 MCP Native: Built-in Model Context Protocol (MCP) server support for seamless integration with external AI clients.

🤖 Graph Agent (Swarm)

🔁 Graph State (Deterministic Pipeline)

📥 Ingestion Pipeline

Getting Started

Installation

To get started with MMCTAgent, clone the repository and install the dependencies:
```
git clone https://github.com/microsoft/MMCTAgent.git
cd MMCTAgent
```
System Dependencies

Install FFmpeg

Linux/Ubuntu:
```
sudo apt-get update
sudo apt-get install ffmpeg libsm6 libxext6 -y
```
Windows:
- Download FFmpeg from ffmpeg.org
- Add the bin folder to your system PATH

Python Environment Setup

Option A: Using Conda (Recommended)

conda create -n mmct-agent python=3.11
conda activate mmct-agent

Option B: Using venv

python -m venv mmct-agent
# Linux/Mac
source mmct-agent/bin/activate
# Windows
mmct-agent\Scripts\activate.bat

Install Dependencies
```
pip install --upgrade pip
pip install .
```
Quick Start Examples

1. Video Ingestion

Transform a video file into a structured knowledge graph (Neo4j).

import asyncio
from mmct.video_pipeline import IngestionPipeline, Languages
# Import your provider configuration or use defaults
from config.provider_config import get_ingestion_providers

async def run_ingestion():
    providers = get_ingestion_providers()
    
    ingestion = IngestionPipeline(
        video_path="path/to/video.mp4",
        video_id="my-unique-video-id",
        language=Languages.ENGLISH_INDIA,
        provider=providers
    )

    report = await ingestion.run()
    print(f"Ingestion {report.status}: {report.summary()}")

asyncio.run(run_ingestion())

Custom Ingestion Steps

Extend the ingestion pipeline with your own processing steps. Import PipelineStep, register_step, and the supporting types from the public API, implement your step class, and reference it in a custom pipeline YAML:

from mmct.video_pipeline import PipelineStep, StepContext, StepResult, register_step

@register_step("myapp.watermark")
class WatermarkStep(PipelineStep):
    step_type = "myapp.watermark"
    description = "Burn a watermark onto extracted frames."

    async def run(self, context: StepContext) -> StepResult:
        # ... your logic here ...
        return StepResult(step_id=self.step_id, outputs={"watermarked": True})

Import your step module before running the pipeline so the decorator registers the class, then point IngestionPipeline at a YAML config that includes the step. See scripts/custom_steps/ for a complete working example.

2. Video Q&A (Unified Pipeline)

Query your video knowledge graph using either agentic swarm or deterministic state modes.

from mmct.video_pipeline.query_pipeline import VideoQueryPipeline, QueryPipelineMode
import asyncio

async def query_video():
    # Initialize the unified pipeline
    # use_provider_defaults=True automatically hydrates providers from config
    pipeline = VideoQueryPipeline(
        mode=QueryPipelineMode.GRAPH_STATE,  # or GRAPH_AGENT for swarm
        use_provider_defaults=True,
        use_critic=True
    )

    # Perform a natural language query
    result = await pipeline.query(
        user_query="What happens after the car enters the frame?",
        video_id="my-unique-video-id"
    )

    print(f"Answer: {result['answer']}")
    await pipeline.close()

asyncio.run(query_video())

3. Image Q&A

Analyze individual images using specialized vision tools.

from mmct.image_pipeline.agents.image_agent import ImageAgent, ImageQnaTools
import asyncio

async def query_image():
    agent = ImageAgent(
        image_path="path/to/image.jpg",
        query="What labels are on the medicine bottles?",
        use_critic_agent=True,
        tools=[ImageQnaTools.ocr, ImageQnaTools.vit]
    )

    response = await agent()
    print(f"Analysis: {response}")

asyncio.run(query_image())

4. MCP Server

Expose MMCT pipelines as tools for model-based agents.

# Start the server
PYTHONPATH=. python3 mcp_server/main.py

Access the server at http://0.0.0.0:8000/mcp. See the mcp_server/ directory for more details.

Provider System

Multi-Cloud & Vendor-Agnostic Architecture

MMCTAgent features a modular provider system that allows you to switch between different cloud providers and AI services seamlessly.

Service Type	Supported Providers	Use Cases
LLM	Azure OpenAI + Custom	Reasoning, chat completion, planning
Search/DB	Neo4j	Graph traversal, vector search
Transcription	Azure Speech, OpenAI Whisper	Audio-to-text conversion
Storage	Azure Blob Storage, Local	Media and artifact management

Note: For detailed implementation, see the Providers Guide.

Project Structure

MMCTAgent
├── examples/               # 💡 Comprehensive Jupyter notebook examples
├── mcp_server/             # 🔌 Model Context Protocol server entry points
├── mmct/                   # 🧠 Core Framework
│   ├── image_pipeline/     # Multi-Agent image reasoning and vision tools
│   ├── video_pipeline/     # End-to-end video intelligence
│   │   ├── core/           # Ingestion, Graph structure, and Registry
│   │   ├── graph_agent/    # Swarm-based (Agentic) orchestration
│   │   ├── graph_state/    # State-Machine (Deterministic) orchestration
│   │   └── query_pipeline.py # Unified video query entry point
│   ├── providers/          # Modular backend service interfaces
│   └── utils/              # Error handling and shared utilities
├── scripts/                # 🛠️ Dev scripts and custom step examples
│   └── custom_steps/       # Example custom ingestion steps
├── api/                    # 🌐 FastAPI web application
├── config/                 # ⚙️ Centralized configuration and hydration
├── infra/                  # 🏗️ Azure Infrastructure deployment guides
└── README.md

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com. When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repositories using our CLA. This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Note: This project is currently under active research and continuous development. While contributions are encouraged, please note that the codebase may evolve as the project matures.

Citation

If you find MMCTAgent useful in your research, please cite:

@article{kumar2024mmctagent,
  title={MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning},
  author={Kumar, Somnath and Gadhia, Yash and Ganu, Tanuja and Nambi, Akshay},
  conference={NeurIPS OWA-2024},
  year={2024},
  url={https://www.microsoft.com/en-us/research/publication/mmctagent-multi-modal-critical-thinking-agent-framework-for-complex-visual-reasoning}
}

License

Licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 336 Commits
.github/workflows		.github/workflows
api		api
config		config
docs/multimedia		docs/multimedia
examples		examples
infra		infra
mcp_server		mcp_server
mmct		mmct
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Dockerfile.base		Dockerfile.base
EVALUATIONS.md		EVALUATIONS.md
LICENSE		LICENSE
README.md		README.md
RESPONSIBLE_AI.md		RESPONSIBLE_AI.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMCTAgent

Overview

Why MMCTAgent?

Key Features

🤖 Graph Agent (Swarm)

🔁 Graph State (Deterministic Pipeline)

Getting Started

Installation

1. Video Ingestion

Custom Ingestion Steps

2. Video Q&A (Unified Pipeline)

3. Image Q&A

4. MCP Server

Provider System

Multi-Cloud & Vendor-Agnostic Architecture

Project Structure

Contributing

Citation

License

Support

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MMCTAgent

Overview

Why MMCTAgent?

Key Features

🤖 Graph Agent (Swarm)

🔁 Graph State (Deterministic Pipeline)

Getting Started

Installation

1. Video Ingestion

Custom Ingestion Steps

2. Video Q&A (Unified Pipeline)

3. Image Q&A

4. MCP Server

Provider System

Multi-Cloud & Vendor-Agnostic Architecture

Project Structure

Contributing

Citation

License

Support

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages