The Framework That Proved Multi-Agent Works
AutoGen earned its position. It was one of the first frameworks to show that multiple AI agents collaborating on a task could produce meaningfully better results than a single agent working alone. 54K GitHub stars. 40% of Fortune 100 tried it. Published at ICLR 2024. The multi-agent concept went from academic curiosity to something developers were actually building with.
Then Microsoft deprecated it.
AutoGen is now in maintenance mode — bug fixes only, no new features. The active development moved to the Microsoft Agent Framework, which merges AutoGen with Semantic Kernel and ties the enterprise path to Azure. The original creators forked it into AG2 under independent governance. What was one framework is now four separate projects with four different futures.
That fragmentation isn't the main story. It's a symptom. The real question is why the organization that built one of the most popular AI agent frameworks decided the architecture wasn't right for production — and what that tells you about the difference between a research framework and a production platform.
What AutoGen Is
AutoGen is a Python framework (also .NET) for building multi-agent systems. The v0.4 rewrite introduced a layered architecture:
AutoGen Core — an Actor model for event-driven agent communication. Agents interact through asynchronous messages. The runtime supports both local development and distributed deployment without architectural changes.
AutoGen AgentChat — the high-level API most developers use. The core concept is teams — groups of agents organized into conversation patterns:
- RoundRobinGroupChat: Agents take turns in a fixed order. Good for structured collaboration — a writer drafts, a reviewer critiques, the writer revises.
- SelectorGroupChat: An LLM dynamically picks which agent speaks next based on the conversation so far. More flexible, less predictable.
- Swarm: Agents hand off to each other explicitly. Gives developers direct control over conversation routing.
- MagenticOneGroupChat: An Orchestrator agent coordinates specialized workers through adaptive planning. The most ambitious pattern.
AutoGen Studio — a visual interface for prototyping agent teams with drag-and-drop configuration, an interactive playground, and a component gallery.
The architecture is genuinely innovative. The team patterns gave developers multiple ways to structure agent collaboration. Model support is broad — OpenAI, Azure OpenAI, and other providers. MIT license. Built-in code execution in Docker, local environments, or Azure Container Apps. A memory protocol with ChromaDB, Redis, and Mem0 backends.
AutoGen validated that multi-agent systems aren't a research curiosity. That contribution is real.
The Fragmentation Problem
If you're evaluating AutoGen today, you need to understand what you're evaluating.
AutoGen has split into four projects:
AutoGen (microsoft/autogen) — maintenance mode. Bug fixes and security patches. No new features. This is the 54K-star repository. It's not where active development happens.
AG2 (ag2ai/ag2) — a community fork by original AutoGen creators under Apache 2.0 license. Independent governance, active development. A bet on the community continuing what Microsoft moved on from.
Microsoft Agent Framework — the official successor, merging AutoGen with Semantic Kernel. Targeting 1.0 GA by end of Q1 2026. Enterprise features built in — but tied to the Azure ecosystem.
Azure AI Foundry Agent Service — the managed cloud runtime. This is the production deployment path Microsoft endorses.
Building on AutoGen today means choosing between a framework in maintenance mode, a community fork with an uncertain long-term trajectory, or Microsoft's new framework that requires Azure.
This isn't unprecedented — open-source projects fork and evolve. But the reason for the fragmentation matters. Microsoft didn't move on because AutoGen was too successful. They moved on because the conversational multi-agent architecture proved insufficient for enterprise production. Their own migration documentation says it directly. Their engineering teams saw the gap and built something different.
The Conversational Architecture Problem
AutoGen's fundamental abstraction is a conversation between agents. You create a team, give it a task, and agents take turns responding. In RoundRobin, they cycle through a fixed order. In Selector, an LLM picks who speaks next. The conversation continues until a termination condition triggers — someone says a specific word, a maximum turn count is reached, or an external signal fires.
This looks impressive in demos. You watch agents debating approaches, catching each other's mistakes, building on ideas. The output is often genuinely good for open-ended creative and research tasks.
In production, the problems become structural.
Non-Determinism
SelectorGroupChat uses an LLM to decide which agent speaks next. That decision is itself non-deterministic. The same question, with the same agents, can produce different conversation paths — different agents contributing in different orders, different intermediate results, different final outputs. The same input on Monday and Tuesday can generate different answers and different costs.
Production systems need predictability. When the CFO asks why the Q4 cost analysis came back different today than last week, "the agents had a different conversation" isn't an acceptable answer.
Thallus uses dependency DAGs — directed acyclic graphs with explicit step ordering. The planner creates a structured execution plan where each step has defined inputs, outputs, and dependencies. Independent steps run in parallel. The execution path is determined by the plan structure, not by which agent happens to speak next.
Infinite Loops
AutoGen's most notorious production failure: agents that never stop talking.
Termination conditions exist — TextMention (stop when someone says "TERMINATE"), MaxMessageTermination, HandoffTermination. But they're fragile. If agents don't produce the expected termination signal, the conversation runs indefinitely. LLM costs spiral. The system hangs.
This isn't a bug — it's an architectural property. Open-ended conversations have no inherent stopping point. Termination is a condition you configure and hope agents trigger. When they don't, there's no structural mechanism to prevent runaway execution.
Thallus makes unbounded execution architecturally impossible through multiple layered safeguards:
DAG architecture: Execution plans are directed acyclic graphs. Cycles cannot exist by definition. Step A depends on Step B, Step B depends on Step C — but Step C can never circle back to depend on Step A. The structure itself prevents infinite loops.
Agent max_iterations: Every agent has a hard cap on its tool-use loop. The agent completes its task or hits the limit and returns what it has. No agent runs indefinitely.
Multi-level timeouts: Timeouts at the workflow level, the branch level, and the individual node level. If any component exceeds its time allocation, it's terminated. No single step can hang the system.
Workflow loop safety caps: Even intentional loop nodes are bounded — maximum iteration count and minimum delay between iterations.
Auto-disable on consecutive failures: Scheduled workflows that fail repeatedly are automatically disabled. A broken integration doesn't hammer a dead service until someone notices.
Bounded evaluation loop: After each execution batch, the planner evaluates: complete, replan, or ask the user. Three outcomes — not "keep the conversation going."
AutoGen relies on developers configuring termination conditions and hopes agents trigger them. Thallus guarantees bounded execution through the architecture itself.
Unpredictable Cost
A conversation that resolves in 3 agent turns one day might take 12 the next, depending on model behavior, intermediate results, and conversation dynamics. When the Selector picks a different agent order, the conversation takes a different path with different token counts.
This makes cost forecasting nearly impossible. When the finance team asks "what will this cost per month?", the honest answer for AutoGen is "it depends on what the agents say to each other." For Thallus, the answer is based on plan steps and agent execution — structured, bounded, measurable.
Microsoft's Own Assessment
Microsoft's documentation acknowledges that conversational multi-agent systems "excel at research demos but lack the governance, observability, and determinism enterprises need for production deployments." This isn't a competitor's critique. It's the assessment from the organization that built the framework, explaining why they're replacing it.
The Data Intelligence Gap
No Database Intelligence
AutoGen has no built-in database system. No connection manager, no schema discovery, no query executor. You can write Python functions that query databases and pass them to agents as tools — community examples exist for text-to-SQL patterns. But the entire data layer is DIY.
What's missing:
- Automatic schema discovery — Thallus scans tables, columns, types, foreign keys, and implicit join paths when a database is connected. AutoGen agents know nothing about your database structure unless a developer hardcodes it into a tool.
- Relationship inference — Thallus detects foreign keys and implicit relationships across tables so agents generate multi-table queries automatically. AutoGen requires manual schema documentation.
- PII detection — Thallus scans seven categories of personally identifiable information when a database is connected and flags sensitive columns for review. AutoGen has no concept of sensitive data.
- Column-level access controls — Thallus makes restricted columns invisible in the schema agents receive — the model never sees fields it shouldn't query. AutoGen has no data access controls.
- Read-only enforcement with query validation — Thallus parses every agent-generated SQL query against 30+ forbidden patterns before it reaches the database. AutoGen has no query validation layer.
Connect a database to Thallus and the system understands it. Connect a database to AutoGen and you're writing the understanding yourself.
No Document Pipeline
AutoGen v0.4 has a Memory protocol with vector search — ChromaDB, Redis, and Mem0 backends. Documents can be chunked, embedded, and searched via semantic similarity. This is more than most frameworks offer.
What it doesn't have:
- Two-stage search: Thallus uses synopsis discovery to identify relevant documents, then chunk retrieval within those documents for precision. AutoGen does flat vector similarity across all chunks.
- Document catalog: Thallus builds a structured catalog of document synopses with weighted tsvector, injected into planner context at three tiers of detail. AutoGen has no document-level understanding — only chunks.
- Cross-source entity linking: Thallus maps entities between documents and database columns, enabling agents to connect "the vendor mentioned in section 3 of the contract" with "vendor_id in the purchase_orders table." AutoGen has no cross-source linking.
- Citation tracking: Thallus traces every claim in the synthesis back to its source document section. AutoGen's memory retrieval returns chunks without provenance tracking.
- Cross-model embedding compatibility: Thallus zero-pads embeddings to 3072 dimensions for compatibility across OpenAI, Azure, Gemini, and Ollama providers. Switching embedding providers in AutoGen means re-embedding everything.
AutoGen's RAG is basic vector similarity search. Thallus's is a structured intelligence pipeline with citations.
The Governance Gap
Zero in the Framework
AutoGen is a Python library. Libraries don't have user management.
There's no RBAC. No audit trails. No multi-tenancy. No SSO. No approval gates. No secrets management. No data access controls. When an agent queries a database, it sees whatever the connection credentials allow. When an agent takes an action, nothing stops it from proceeding. When an auditor asks what the AI accessed last Tuesday, there's no log to show them.
This isn't a criticism of AutoGen's design goals — it was built as a research framework, and research frameworks don't need compliance infrastructure. The problem is that organizations evaluate AutoGen for production use, discover the governance gap, and face an uncomfortable choice.
The Azure Trap
The governance features enterprises need exist — but only in the Microsoft ecosystem:
- RBAC via Microsoft Entra ID
- Audit trails via Azure Monitor and OpenTelemetry
- Multi-tenancy with per-tenant isolation
- Content safety via Azure AI Content Safety
- Compliance via Microsoft Purview (50+ certifications including SOC 2, FedRAMP, ISO 27001)
These aren't AutoGen features. They're Azure features. To get enterprise governance, you leave the open-source framework and lock into Microsoft's cloud ecosystem. Your "open-source" AI agent strategy ends with an Azure dependency.
For organizations already all-in on Azure, this may be acceptable. For everyone else — especially those with multi-cloud strategies, self-hosting requirements, or a preference for vendor independence — it means the open-source evaluation was a detour. The production path was always Azure.
AutoGen Studio: "Not Meant for Production"
Microsoft built AutoGen Studio as a visual interface for prototyping agent teams. It's useful for experimentation — drag-and-drop team configuration, interactive playground, live message streaming, cost profiling.
Microsoft's documentation explicitly states it is "not meant to be used in a production environment." It lacks security testing for jailbreaking prevention, has no granular data access controls, and should not be exposed to untrusted users.
The creator is telling you their own visual tool isn't production-ready. The prototyping experience is good. The production path goes through Azure or custom development.
Thallus: Governance Without Lock-In
Thallus includes governance at every tier — not as an enterprise add-on, not behind a cloud dependency:
4-tier RBAC (Platform → Organization → Group → User) controlling which agents each user invokes, which tools those agents use, and which database tables and columns those tools query. The marketing team uses the document search agent but not the database query agent. The finance team queries the financial database but can't see the SSN column. Each layer enforced in application code, not prompt instructions.
Immutable audit trails capturing every tool call, every query, every agent decision — with the agent's reasoning alongside each action. Sensitive parameters automatically redacted. Structured, tamper-evident, compliance-ready.
Approval gates as a native workflow node with multi-approval flows, configurable timeouts, escalation paths, full context presentation, and audit integration. The workflow physically cannot proceed without human authorization.
Code-level tool confirmations where write operations require human confirmation, enforced by prefix matching outside the model's execution loop. Prompt injection can't bypass it.
Self-hosted or SaaS — same features either way. Air-gap capable with Ollama for local models. No cloud vendor required.
The User Access Gap
No Ad Hoc Questions
AutoGen requires a developer to define agent teams, configure tools, and write code before anyone can ask a question. Want to analyze vendor spend? Someone writes a team with the right agents and tools. Want to compare contract terms? Someone writes another team.
In Thallus, you type: "Compare our vendor spend against contract terms for renewals due next quarter." The planner decomposes it, identifies data sources, assigns agents, maps dependencies, and executes. No one predefined the workflow. No one wrote Python. The planning happens at query time, not at development time.
This isn't a convenience feature. It's a fundamentally different capability. AutoGen answers questions a developer anticipated and built a team for. Thallus answers questions nobody anticipated.
No Non-Developer Access
AutoGen is a Python library. Every interaction requires code. Every new agent team, every new analysis capability, every adjustment to conversation patterns — developer task.
Thallus provides three paths that don't require code:
Chat interface — type a question, get a cited analysis. No configuration needed beyond connecting data sources.
Natural language workflow creation — describe what you want in English ("Every weekday at 9 AM, research market trends for our top 3 competitors, check if any significant changes are found, and if so send a summary to the #market-intel Slack channel"). The AI generates the complete DAG with trigger, action nodes, conditions, and delivery. Review, adjust visually, activate.
Visual DAG editor — 9 node types, drag-and-drop, side-panel configuration with natural language instructions. Agent auto-suggestion based on task description.
Domain experts — analysts, operations managers, compliance officers — build and modify workflows without filing a ticket with engineering.
The Workflow Gap
AutoGen recently added GraphFlow — a DAG-based workflow system where nodes are agents and edges define execution paths. It supports sequential chains, parallel fan-out/fan-in, conditional branching, and loops.
It's experimental. Still has open bugs — conditional edges not being followed, nodes executing twice. And it's missing the infrastructure that makes workflows production-ready:
- No scheduling — no cron triggers, no poll-based triggers. You build scheduling externally.
- No approval gates — HandoffTermination can pause for human input, but there's no formal approval workflow with timeouts, escalation, or persistent state.
- No versioning — no version history, no rollback, no breaking change detection.
- No delivery — no built-in email, Slack, webhook, or document store output.
- No crash recovery — no persistent workflow state for resuming after failures.
- No success evaluation — nodes complete when agents finish, with no assessment of whether the result actually meets the objective.
Thallus workflow engine: 9 node types (trigger, action, condition, router, merge, loop, subworkflow, approval, delivery), Celery Beat scheduling, Redis-backed crash recovery, AI-evaluated success criteria, workflow versioning with breaking change detection, auto-disable on failures, multi-level timeouts, natural language creation. GraphFlow is a workflow primitive. This is a workflow engine.
The Reasoning Gap
Conversations vs. Investigations
AutoGen agents take turns in a conversation. The conversation is the reasoning process — agents respond to each other's messages, building on or correcting what came before.
Thallus agents execute steps in a dependency DAG. Each agent has a specific task assigned by the planner, receives targeted context from the shared board, and posts structured findings back when done.
The difference: in AutoGen, the reasoning path emerges from conversation dynamics — which agent speaks when, what they choose to say. In Thallus, the reasoning path is planned, executed, evaluated, and adapted.
No Dynamic Planning
AutoGen team patterns are chosen at development time: RoundRobin, Selector, Swarm, or MagenticOne. The pattern is fixed. What changes between runs is what the agents say, not how the orchestration works.
Thallus plans at query time. "Why did our Q4 operating costs spike 15% versus forecast?" generates a plan that queries the general ledger, the purchase order system, and the HR database, then searches contract documents and vendor performance reports. The planner identifies all of this from the question and the available data. An AutoGen developer would need to have anticipated this specific combination and built a team for it.
No Adaptive Re-Planning
AutoGen conversations flow until termination. There's no evaluation step that asks "based on what we've learned so far, should we change the approach?"
In Thallus, the evaluation loop after each execution batch decides: is the plan complete? Does it need additional steps? Should it be restructured? If an agent discovers an unexpected $2M vendor charge while querying the GL, the evaluator adds steps to investigate that vendor — querying procurement, searching for the contract, checking the PO system. None of these steps were in the original plan. They emerged from the investigation.
This is what turns orchestration from "run these agents" into "investigate this question until you have a sufficient answer."
No Shared Structured Context
AutoGen agents share context through conversation messages. Each agent sees previous messages — unstructured text. With GraphFlow's MessageFilterAgent, you can restrict what each agent sees, but the context is still conversation history.
Thallus uses a Redis-backed board with structured data: table schemas, document catalogs, cross-source entity links, agent findings. Each agent receives only the relevant board data for its specific task. The database query agent gets schema information. The document search agent gets the document catalog. The synthesis agent gets structured findings from previous steps.
A database agent discovers that vendor spend is concentrated in three categories. The document search agent, seeing this on the board, focuses its contract search on those three categories instead of searching broadly. The synthesis agent correlates spend data with contract terms and identifies that one category is 40% over its contract ceiling. This cross-pollination of context requires a shared structured space — not conversation messages.
Three Reasoning Modes
AutoGen uses whatever team pattern the developer chose for every question.
Thallus auto-detects the appropriate reasoning mode:
ASK — single-agent response for straightforward questions. Fast, no planning overhead.
RESEARCH — planner-directed DAG execution with parallel steps, dependency management, and evaluation loops. For analytical questions spanning data sources.
INVESTIGATE — supervisor-driven reactive investigation pursuing multiple hypotheses based on emerging findings. For complex, open-ended questions where the investigation path can't be planned upfront.
"What's our revenue this quarter?" routes to ASK. "Compare vendor spend against contract terms across regions" routes to RESEARCH. "Why did customer churn spike last month and what should we do about it?" routes to INVESTIGATE. The system matches the strategy to the question. The user doesn't choose. The system detects.
The Comparison
| Capability | AutoGen | Thallus |
|---|---|---|
| Ad hoc questions | Requires predefined teams in code | Ask anything — dynamic planning at query time |
| Execution model | Agent conversations (non-deterministic, infinite loop risk) | Dependency DAGs (bounded, structured, loops impossible) |
| Dynamic planning | Fixed conversation patterns | AI planner creates DAGs, re-plans based on results |
| Context architecture | Conversation messages (unstructured) | Shared board (structured, scoped per agent) |
| User interface | Python only (Studio not for production) | Chat + visual DAG editor + natural language workflows |
| Database intelligence | Build your own tools | Native schema discovery, PII detection, column-level controls |
| Document RAG | Basic vector search (ChromaDB/Redis) | Two-stage semantic search, citations, cross-source linking |
| RBAC | None (Azure for enterprise) | 4-tier at agent, tool, and data level |
| Audit trails | Basic logging | Immutable with agent reasoning, PII redaction |
| Approval gates | None | Native with multi-approval, timeouts, escalation |
| Workflow engine | GraphFlow (experimental) | Full DAG: 9 node types, scheduling, delivery, versioning, crash recovery |
| Self-hosted | Library (no platform) | Full platform via Docker, air-gap capable |
| Enterprise path | Azure AI Foundry (vendor lock-in) | Self-hosted or SaaS — no lock-in |
| Pricing | Free (MIT) — enterprise requires Azure | From $30/mo — self-hosted Enterprise unlimited |
When You Need What
Use AutoGen when:
- You're doing academic research on multi-agent conversation patterns
- You want to prototype conversational agent interactions quickly
- The use case is experimental and won't go to production on its own
- You have Azure infrastructure and plan to migrate to the Microsoft Agent Framework
- You're comfortable building on maintenance-mode software
Use Thallus when:
- Users need to ask ad hoc questions across data sources without predefined agent teams
- You need bounded, predictable execution — not conversations that might loop indefinitely
- Enterprise governance is non-negotiable — RBAC, audit trails, approval gates, data-level access controls
- Non-technical users need to build and run AI workflows
- You need native database intelligence and document search with citations
- Self-hosting and vendor independence matter
- You need production workflows with scheduling, delivery, versioning, and crash recovery
- You'd rather deploy a production platform than build one on a deprecated research framework
The Bottom Line
AutoGen earned its place in the history of AI agent frameworks. It showed that multi-agent collaboration produces better results than single-agent approaches. It attracted massive adoption and validated an entire category. The research contribution is real and lasting.
But the architecture that made those demos compelling — open-ended conversations between agents — is the same architecture that made production deployment unreliable. Non-deterministic execution paths. Infinite loop risks. No governance hooks. No data intelligence. No way for the compliance team to audit what happened, or for the operations manager to build a workflow without writing Python.
Microsoft saw this and built something else. The question for your organization isn't whether AutoGen was a good research framework — it was. The question is whether a research framework in maintenance mode, with an enterprise path that leads exclusively through Azure, is the right foundation for production AI agents.
The alternative is a platform built for production from day one: bounded execution that makes infinite loops structurally impossible, governed data access, structured investigation architecture, adaptive reasoning, and a user experience that works for the whole organization — not just the developer who wrote the team definition. Self-hosted or SaaS, no cloud vendor required, governance at every tier.