Most AI agent projects do not fail because the underlying model is incapable. They fail because no one has a clear picture of what the project actually is — structurally, operationally, or in terms of what it can and cannot do on its own.
The CGM Audit Framework gives a project a level. Knowing the level tells you what questions to ask, what risks are active, and what the next step looks like.
The Framework
Level 0: Unguided
Characteristics: No system prompt. No persistent context. Direct model access via API or chat interface. Each interaction is stateless.
The model is capable. The project is not a project yet — it is a model call. There is no defined behavior, no guardrails, and no way to ensure consistent output. Results vary with phrasing.
Real example: A team using the raw ChatGPT API to generate marketing copy, with each call sent without a system prompt. Output quality depended entirely on how the user phrased the individual request.
Audit flag: No system prompt means no contract. The model will attempt to satisfy whatever is in the user message, which is unpredictable at scale.
Level 1: Guided
Characteristics: System prompt defines the model's role, tone, and output format. Context is still stateless, but behavior is consistent within a session.
A Level 1 project has made deliberate choices about what the model should be. It will produce recognizable output. It cannot act on its own, access external systems, or handle multi-step tasks.
Real example: A customer support assistant with a system prompt that defines persona, escalation language, and response format. The model handles single-turn questions. A human handles everything else.
Audit flag: The system prompt is the entire quality layer. If the prompt is vague or under-specified, output quality degrades with no mechanism to catch it.
Level 2: Tooled
Characteristics: The agent has access to tools — search, APIs, database reads, file operations. It can take actions in external systems. Output is no longer just text; it includes side effects.
This is where most production agent projects live. The capability is real and the risk surface expands significantly. A Level 2 agent can do things, which means it can do the wrong things.
Real example: A research assistant that retrieves documents from a knowledge base, summarizes them, and writes findings to a shared workspace. The agent reads from external systems and writes to them.
Audit flag: Tool permissions need explicit scope. Every tool available to the agent is a potential misuse vector. The audit question is: can this tool be triggered in a context where the side effect is harmful? If yes, add a confirmation step or scope restriction.
Level 3: Orchestrated
Characteristics: Multiple agents or subagents operating within a coordinated workflow. Tasks are routed, delegated, and tracked across components. The system has memory — session-level, user-level, or project-level.
A Level 3 project has architecture. There is a primary orchestrator, specialized subagents with defined scopes, and handoff logic between them. Failures can cascade; a subagent error can corrupt the orchestrator's context.
Real example: A document production pipeline where an intake agent parses requirements, a drafting agent generates content, a review agent runs validation passes, and an export agent formats and publishes. Each agent has a defined role and bounded permissions.
Audit flag: Context integrity across handoffs. When the orchestrator passes instructions to a subagent, is the context complete, accurate, and scoped correctly? Context corruption is the primary failure mode at Level 3, and it is hard to detect without explicit logging.
Level 4: Self-Improving
Characteristics: The system modifies its own behavior based on outcomes. This includes updating system prompts, adjusting tool configurations, rewriting skill files, or changing routing logic based on performance data.
Level 4 is rare and carries the highest risk surface. The system is not just acting — it is changing what it is. Without strict boundaries on what can be modified and under what conditions, a Level 4 system can drift into behavior that no human explicitly designed.
Real example: An agent pipeline that tracks which skill files trigger correctly, identifies underperforming trigger conditions, and rewrites the trigger instructions to improve accuracy. The system's behavior changes over time based on its own operational data.
Audit flag: Self-modification scope must be tightly bounded. Define exactly what can be changed, by what mechanism, under what conditions, and with what human approval checkpoint. An unbounded Level 4 system is not a quality assurance story — it is a liability story.
Progressing Between Levels
Movement between levels is not automatic. Each transition requires deliberate decisions:
0 to 1: Write a system prompt. Define role, constraints, output format. Test against edge cases before deploying.
1 to 2: Add tools one at a time. For each tool, define: what it can access, what it cannot access, what a bad tool call looks like, and how the system recovers from one.
2 to 3: Design the orchestration layer before building it. Define agent boundaries, handoff contracts, and failure handling. Add logging at every handoff point from day one.
3 to 4: Only cross this threshold with explicit human review checkpoints on every self-modification. Define a rollback mechanism. Treat any system prompt or skill file change as a deployment event.
Using the Framework
The value of knowing a project's level is not the label. It is the questions the label forces.
A Level 2 project without a tool permission audit is a project with unknown risk surface. A Level 3 project without handoff logging is a project where failures are invisible until they are catastrophic. A Level 4 project without rollback is a project that cannot be unwound.
Run the audit. Know the level. Address the gaps that level implies before moving to the next one.