Outcome Governance Benchmark · March 2026
Twelve AI agent frameworks. Three leading model families. Eight hundred and twenty-eight scored decisions. Every governed framework graded A or B. Every ungoverned framework graded F. The causal information architecture determined the outcome, not the model.
The setup
Identical scenarios across three LLM families, with and without OSR governance. Six scoring criteria, two hundred and ten points total. The setup was designed to isolate what actually drives outcome quality.
Three leading model families, each run through every framework. The same prompts. The same scenarios. The same scoring rubric.
Single-agent, multi-agent, orchestrated. LangChain, CrewAI, AutoGen, and more. Six with OSR + CPP governance, six ungoverned.
Two multi-year simulation scenarios covering demand shocks, capacity constraints, and financial masking. Eight hundred twenty-eight scored decisions.
Key findings
Five detection patterns appeared consistently in the governed cohort. None appeared unprompted in the ungoverned cohort across any model family.
Surface revenue metrics appeared healthy while underlying operational stocks collapsed. Governed frameworks detected it via the SC-05 criterion. No ungoverned framework surfaced the masking unprompted across any of the three model families.
Ungoverned agents optimized the visible revenue line and inverted the underlying outcome. Cash flowed in while the system that produced the cash was eroding. Governance caught the decoupling; pure target-following did not.
Governed frameworks tolerated short-term apparent stagnation to protect long-term system health. Ungoverned frameworks chased short-term motion signals and collapsed by month 24. The governed curves were slower to move and far more durable.
When asked to self-assess, ungoverned frameworks produced plausible-sounding health narratives that contradicted the ground truth in the simulation. Governance made the contradiction visible. Without it, the narrative won.
CrewAI and LangChain received identical OSR specs. The difference in outcome quality between them was negligible under governance and enormous without it. Orchestration framework mattered less than governance presence.
Download the artifacts
The benchmark is published as three separate artifacts so readers can pull the perspective they need. All three are ungated and free.

The headline result. Twelve frameworks, three model families, complete tier separation between governed and ungoverned cohorts.

Scoring across all six criteria, framework by framework. The visual proof of where governance moved the curve and where it didn't.

The complete narrative. Methodology, scorecards, and the five named detection patterns from the March 2026 study.
The full report
The full March 2026 report covers the complete methodology, all twelve framework scorecards, per-criterion breakdowns across the three model families, and the full set of detection patterns.