Platform AI Part 3: Building a Multi-Agent System for DevOps — hero banner

January 18, 2026·12 min read

This is Part 3 of the Platform AI series. Part 1 covers building the basic chatbot. Part 2 adds agentic tool use for AKS management.

The single-agent approach from Part 2 worked fine for straightforward requests. Ask it to check cluster status, it calls the right tool. Ask it to stop a cluster, it confirms and executes. But as I started using it for real troubleshooting sessions, the cracks showed up fast.

A question like "why is my pod crashing?" requires a different mindset than "start the hub cluster." The first needs investigation, hypothesis forming, log analysis. The second needs careful execution with pre-flight checks and confirmation. Cramming both behaviors into a single system prompt creates a confused agent that either over-investigates simple operations or rushes through debugging without proper analysis.

The solution was obvious once I saw the pattern. Instead of one agent trying to be everything, build a team of specialists and let them collaborate.

Live: https://chrishouse.io/tools/ai


The Multi-Agent Architecture

The new system routes requests to one of five specialized agents, each optimized for a specific type of work.

┌─────────────────┐     ┌─────────────────────┐     ┌──────────────────┐
│   User Message  │────>│  Agent Orchestrator │────>│  Selected Agent  │
└─────────────────┘     └─────────────────────┘     └──────────────────┘
                               │                           │
                               │ Intent Classification     │ Specialized
                               │ Confidence Scoring        │ System Prompt
                               │ Handoff Management        │ Filtered Tools
                               v                           v
                        ┌─────────────────────────────────────────────┐
                        │              Agent Types                     │
                        ├──────────────────────────────────────────────
                        │  Triage    - Status checks, overview        │
                        │  Debugger  - Investigation, root cause      │
                        │  Operator  - Safe mutations, verification   │
                        │  Advisor   - Best practices, recommendations│
                        │  IaC       - Terraform, Crossplane, GitOps  │
                        └─────────────────────────────────────────────┘

Each agent has its own system prompt, a filtered set of tools, and rules for when to hand off to another specialist. The orchestrator sits in front, classifying intent and routing requests to the right agent.


The Five Agents

Triage Agent

Triage handles quick status checks and overview requests. When someone asks "what's running?" or "show me the pods," they probably want a fast answer, not a deep investigation. Triage gathers the requested data and presents it clearly.

Triage has read-only tools like get_cluster_status, get_pods, get_deployments, and get_events. It cannot execute mutations. If someone asks Triage to restart a deployment, it recognizes that requires the Operator agent and initiates a handoff.

Debugger Agent

Debugger is the investigator. When something is broken and you need to understand why, Debugger formulates hypotheses and tests them systematically. It looks at pod descriptions, pulls logs from current and previous containers, checks events, and correlates findings.

The Debugger system prompt explicitly instructs it to think in terms of root cause analysis. It gathers context first, forms a hypothesis about what might be wrong, tests that hypothesis with targeted tool calls, and presents findings with confidence levels.

Operator Agent

Operator handles mutations. Starting clusters, stopping clusters, deleting pods, restarting deployments. These actions have real consequences, so Operator is built around safety. Every mutation requires user confirmation. Operator runs pre-flight checks before actions and verification checks after.

The confirmation flow is baked into the agent itself. When Operator decides to call stop_cluster, it first verifies the cluster is actually running, warns about impacts like "12 pods will be terminated," and waits for explicit user approval before executing.

Advisor Agent

Advisor provides recommendations and best practices. When someone asks "should I use a DaemonSet or Deployment for this?" or "what's the best way to handle secrets?", Advisor draws on Kubernetes patterns and platform engineering principles to give guidance.

Advisor has no mutation tools at all. It can query current state for context but cannot make changes. Its role is purely consultative.

IaC Agent

IaC handles Infrastructure as Code questions. Terraform modules, Crossplane compositions, Helm charts, GitOps workflows. It knows how to read HCL syntax, understands Crossplane XRDs, and can help with ArgoCD application definitions.

IaC also integrates with a RAG system that indexes the actual repository. When someone asks about a specific Terraform module, IaC can search the codebase and reference real files rather than hallucinating generic examples.


Agent Selection

The orchestrator classifies each incoming message and selects the most appropriate agent. This happens through pattern matching and confidence scoring.

// Intent patterns for routing
const INTENT_PATTERNS = {
  debug: [
    /why.*(fail|crash|error|not working|stuck)/i,
    /debug/i,
    /troubleshoot/i,
    /investigate/i,
    /what.*(wrong|issue|problem)/i,
    /crashloop/i,
    /diagnose/i,
    /root cause/i,
    /analyze.*(logs?|error|issue)/i,
  ],

  operate: [
    /\b(start|stop|restart|scale|delete|rollout)\b/i,
    /\b(deploy|upgrade|rollback)\b/i,
    /apply.*changes?/i,
    /execute/i,
  ],

  advise: [
    /best practice/i,
    /recommend/i,
    /should I/i,
    /how should/i,
    /what.*(approach|strategy)/i,
    /optimize/i,
  ],

  triage: [
    /status/i,
    /overview/i,
    /health/i,
    /what.*(running|deployed|happening)/i,
    /show.*all/i,
    /list/i,
    /get.*pods?/i,
    /check/i,
  ]
};

Each pattern match adds to the score for that intent. The highest scoring intent determines which agent handles the request.

export function classifyIntent(message) {
  const scores = {};

  for (const [intent, patterns] of Object.entries(INTENT_PATTERNS)) {
    scores[intent] = 0;
    for (const pattern of patterns) {
      if (pattern.test(message)) {
        scores[intent]++;
      }
    }
  }

  // Find highest scoring intent
  let maxIntent = 'triage'; // Default to triage
  let maxScore = 0;

  for (const [intent, score] of Object.entries(scores)) {
    if (score > maxScore) {
      maxScore = score;
      maxIntent = intent;
    }
  }

  // Map intent to agent
  const agentMap = {
    debug: 'debugger',
    operate: 'operator',
    advise: 'advisor',
    triage: 'triage'
  };

  const confidence = maxScore > 0 ? Math.min(maxScore / 3, 1) : 0.3;

  return {
    intent: maxIntent,
    confidence,
    suggestedAgent: agentMap[maxIntent],
    scores
  };
}

A message like "why is my pod crashing and restarting?" hits multiple debug patterns. "Crashing" matches the crash pattern, "why" matches the why pattern, and the combination creates high confidence for the Debugger agent.


The Agent Orchestrator

The orchestrator is the traffic controller. It maintains state about which agent is currently active, handles agent selection, and manages handoffs between agents.

export class AgentOrchestrator {
  constructor(tools, options = {}) {
    this.tools = tools;
    this.agents = {};

    // Initialize all agents
    for (const [name, AgentClass] of Object.entries(AGENT_TYPES)) {
      this.agents[name] = new AgentClass(tools, options);
    }

    this.currentAgent = null;
    this.handoffHistory = [];
    this.onAgentSwitch = options.onAgentSwitch || null;
  }

  selectAgent(message, context = {}) {
    const scores = [];

    for (const [name, agent] of Object.entries(this.agents)) {
      const assessment = agent.canHandle(message, context);
      scores.push({
        agent: name,
        confidence: assessment.confidence,
        canHandle: assessment.canHandle,
        reason: assessment.reason
      });
    }

    // Sort by confidence
    scores.sort((a, b) => b.confidence - a.confidence);
    const best = scores[0];

    // If coming from handoff, respect the suggested agent
    if (context.handoff?.targetAgent && this.agents[context.handoff.targetAgent]) {
      return {
        agent: context.handoff.targetAgent,
        confidence: 0.9,
        reason: `Handoff from ${context.handoff.fromAgent}: ${context.handoff.reason}`,
        alternatives: scores.filter(s => s.agent !== context.handoff.targetAgent).slice(0, 2)
      };
    }

    return {
      agent: best.agent,
      confidence: best.confidence,
      reason: best.reason,
      alternatives: scores.slice(1, 3)
    };
  }
}

Each agent implements a canHandle method that returns a confidence score. The Debugger agent, for example, looks for investigation-related keywords and returns high confidence when it sees patterns indicating a troubleshooting session.

// From debugger-agent.js
canHandle(message, context = {}) {
  let score = 0;

  for (const pattern of DEBUGGER_PATTERNS) {
    if (pattern.test(message)) {
      score += 0.25;
    }
  }

  // Coming from handoff with issue to investigate
  if (context.handoff?.findings?.length > 0) {
    score += 0.4;
  }

  // Explicit debug keywords
  if (/\b(debug|investigate|root cause|diagnose)\b/i.test(message)) {
    score += 0.3;
  }

  const confidence = Math.min(1, score);

  return {
    canHandle: confidence > 0.3,
    confidence,
    reason: confidence > 0.5 ? 'Request indicates investigation needed' : 'May need debugging'
  };
}

Agent Handoffs

Sometimes an agent realizes mid-conversation that another specialist would be better suited. The Triage agent might discover a crashing pod while checking status and hand off to Debugger. The Debugger might identify a fix and hand off to Operator to execute it.

Handoffs preserve context. When Debugger hands off to Operator, it passes along what it discovered, what tools it already called, and what action it recommends. The receiving agent gets this context injected into its system prompt so it can continue intelligently rather than starting from scratch.

handleHandoff(fromAgent, toAgent, reason, context = {}) {
  const sourceAgent = this.agents[fromAgent];
  const targetAgent = this.agents[toAgent];

  // Build handoff context
  const handoffContext = {
    fromAgent,
    toAgent,
    reason,
    timestamp: Date.now(),
    ...sourceAgent.buildHandoffContext(
      context.workDone || [],
      context.findings || [],
      context.toolResults || []
    )
  };

  // Record handoff
  this.handoffHistory.push(handoffContext);

  // Switch to new agent
  this.currentAgent = toAgent;

  if (this.onAgentSwitch) {
    this.onAgentSwitch({
      agent: toAgent,
      confidence: 0.9,
      reason: `Handoff: ${reason}`,
      handoff: true,
      from: fromAgent
    });
  }

  return handoffContext;
}

Each agent has rules about when to suggest a handoff. The Operator agent, for example, hands off to Debugger if an action fails and needs investigation.

// From operator-agent.js
suggestHandoff(message, toolResults = [], context = {}) {
  // If action failed, suggest debugger
  const failedActions = toolResults.filter(r =>
    !r.success || r.data?.error || r.error
  );

  if (failedActions.length > 0) {
    return {
      shouldHandoff: true,
      targetAgent: 'debugger',
      reason: `Action failed: ${failedActions[0].error}. Debugger can investigate.`,
      context: { failedActions }
    };
  }

  // If user asks for explanation after action
  if (/why|should|best\s*practice|recommend/i.test(message)) {
    return {
      shouldHandoff: true,
      targetAgent: 'advisor',
      reason: 'User asking for recommendations/best practices',
      context: { completedActions: toolResults.filter(r => r.success) }
    };
  }

  return null;
}

Investigation Workflows

The Debugger agent is where multi-agent really shines. A simple "why is my pod crashing?" triggers a structured investigation.

The agent first gathers context. It calls describe_pod to get the pod specification, conditions, and container statuses. It pulls current logs with get_pod_logs and previous container logs to catch errors that happened before the last restart. It checks recent events for warnings like OOMKilled or ImagePullBackOff.

All of this happens in a single reasoning loop. The agent decides which tools to call, executes them, analyzes the results, and either continues investigating or presents its findings.

// Debugger agent system prompt (excerpt)
`You are the Debugger Agent for Platform AI. Your role is to:

1. **Gather Context**: Get pod status, logs, events, resource usage
2. **Form Hypotheses**: Based on symptoms, identify possible causes
3. **Test Hypotheses**: Use targeted queries to confirm or rule out causes
4. **Present Findings**: Explain root cause with evidence

## Investigation Process

When investigating an issue:
1. Start with describe_pod to understand current state
2. Pull logs (current + previous) for error messages
3. Check events for warnings (OOMKilled, ImagePullBackOff, etc.)
4. Correlate findings to identify root cause
5. Present diagnosis with confidence level

## When to Suggest Handoff
- **→ Operator**: If you've identified a fix (restart, scale, delete)
- **→ Advisor**: If user asks for best practices to prevent recurrence
- **→ Triage**: If issue resolved and user wants status overview`

The result is an agent that thinks like a platform engineer. It does not just dump logs and hope the user figures it out. It correlates evidence, identifies patterns, and explains what went wrong and why.


The Base Agent Pattern

All five agents extend a common base class that provides shared functionality. This keeps the architecture consistent and makes adding new agents straightforward.

export class BaseAgent {
  constructor(options = {}) {
    if (new.target === BaseAgent) {
      throw new Error('BaseAgent is abstract and cannot be instantiated directly');
    }

    this.name = options.name || 'BaseAgent';
    this.description = options.description || '';
    this.capability = options.capability || AgentCapability.READ_ONLY;
    this.priority = options.priority || AgentPriority.MEDIUM;
    this.tools = options.tools || [];
    this.maxIterations = options.maxIterations || 5;
    this.icon = options.icon || 'đŸ€–';
    this.color = options.color || '#6366f1';
  }

  // Must be implemented by subclasses
  buildSystemPrompt(context = {}) {
    throw new Error('buildSystemPrompt must be implemented by subclass');
  }

  canHandle(message, context = {}) {
    throw new Error('canHandle must be implemented by subclass');
  }

  // Optional overrides
  suggestHandoff(message, toolResults = [], context = {}) {
    return null;
  }

  preprocessMessage(message, context = {}) {
    return message;
  }

  postprocessResponse(response, toolResults = [], context = {}) {
    return response;
  }
}

The capability system controls what agents can do. READ_ONLY agents like Triage and Advisor cannot execute mutations. EXECUTE agents like Operator require confirmation for destructive actions. This is enforced at the tool level, so even if a prompt injection tried to get Advisor to delete a pod, the tool simply would not be available.

export const AgentCapability = {
  READ_ONLY: 'read_only',       // Can only query/read
  SUGGEST: 'suggest',           // Can suggest actions
  EXECUTE: 'execute',           // Can execute with confirmation
  AUTONOMOUS: 'autonomous'      // Can execute without confirmation (rare)
};

Confirmation Flow for Mutations

The Operator agent has a specific confirmation flow for dangerous operations. Every mutation goes through the same pattern: verify current state, present impact, wait for explicit confirmation, execute, verify result.

// Actions that require confirmation
const CONFIRMATION_REQUIRED = {
  start_cluster: { level: 'high', message: 'Starting a cluster will incur compute costs' },
  stop_cluster: { level: 'high', message: 'Stopping will make the cluster unavailable' },
  delete_pod: { level: 'medium', message: 'Pod will be recreated by its controller' },
  restart_deployment: { level: 'medium', message: 'Will trigger rolling restart of all pods' },
  scale_nodepool: { level: 'high', message: 'Will change the number of VMs (affects cost)' }
};

// Pre-flight checks for each action
const PREFLIGHT_CHECKS = {
  start_cluster: ['get_cluster_status'],
  stop_cluster: ['get_cluster_status', 'get_pods'],
  delete_pod: ['describe_pod'],
  restart_deployment: ['get_deployments', 'get_pods'],
  scale_nodepool: ['list_nodepools', 'get_pods']
};

When Operator decides to stop a cluster, it first calls get_cluster_status and get_pods as pre-flight checks. If the cluster is already stopped, it reports that and does not ask for confirmation. If pods are running, it warns the user how many will be terminated.

getConfirmationDetails(toolName, args = {}) {
  const config = CONFIRMATION_REQUIRED[toolName];
  if (!config) return null;

  let description = `${toolName}: `;

  switch (toolName) {
    case 'start_cluster':
      description = `Start cluster "${args.cluster_name}"`;
      break;
    case 'stop_cluster':
      description = `Stop cluster "${args.cluster_name}"`;
      break;
    case 'delete_pod':
      description = `Delete pod "${args.pod_name}" in ${args.namespace}`;
      break;
    case 'restart_deployment':
      description = `Restart deployment "${args.deployment_name}" in ${args.namespace}`;
      break;
  }

  return {
    action: toolName,
    description,
    level: config.level,
    warning: config.message,
    args
  };
}

Real-Time Feedback with SSE

The frontend shows which agent is active and when switches happen. This is handled through Server-Sent Events that the orchestrator emits on state changes.

export function createOrchestratorWithSSE(tools, emitEvent, options = {}) {
  return new AgentOrchestrator(tools, {
    ...options,
    onAgentSwitch: (selection) => {
      emitEvent('agent_switch', {
        agent: selection.agent,
        reason: selection.reason,
        confidence: selection.confidence,
        from: selection.from,
        handoff: selection.handoff || false,
        timestamp: Date.now()
      });
    }
  });
}

The React frontend listens for these events and updates the UI. Users see a small indicator showing which agent is handling their request. When a handoff happens, they see it transition smoothly from Debugger to Operator, for example.

This transparency builds trust. Users understand that when they ask an investigation question, they get the investigation specialist. When they ask to execute something, they get the operations specialist.


Security Layers

Before any message reaches an agent, it passes through security filters. The system blocks requests for secrets, credentials, and sensitive data. It also catches off-topic requests before they waste model tokens.

const BANNED_PATTERNS = [
  /\bsecret[s]?\b/i,
  /\bpassword[s]?\b/i,
  /\bcredential[s]?\b/i,
  /\bapi[_-]?key[s]?\b/i,
  /\bkubeconfig\b/i,
  /kubectl\s+get\s+secret/i,
  /kubectl\s+describe\s+secret/i,
];

function containsSensitiveRequest(text) {
  return BANNED_PATTERNS.some(pattern => pattern.test(text));
}

Off-topic detection prevents people from using the assistant for general chat, math homework, or prompt injection attempts.

const OFF_TOPIC_PATTERNS = [
  // Math
  /^\s*\d+\s*[\+\-\*\/xĂ—Ă·]\s*\d+/i,
  /calculate|compute|solve.*\d/i,

  // Jailbreak attempts
  /ignore\s*(your|previous|all)\s*(instructions?|rules?)/i,
  /pretend\s*(you('re)?|to\s*be)/i,
  /you\s*are\s*now\s*(a|an|no\s*longer)/i,

  // General knowledge
  /what\s*(is|are)\s*(the\s*)?(sun|moon|stars?|planets?)/i,
  /write\s*(me\s*)?(a\s*)?(poem|story|essay)/i,
];

const OFF_TOPIC_REDIRECT = "I'm Platform AI - I only help with Kubernetes and infrastructure. Try asking about cluster status, deployments, pods, or Azure resources!";

These filters run before the orchestrator even sees the message. A request for secrets gets rejected immediately with a clear explanation of the security policy. An off-topic request gets a polite redirect without consuming any Claude API tokens.


Technical Decisions

Why five agents instead of three or ten?

Five covers the major interaction patterns I observed in my own usage. Status checks (Triage), troubleshooting (Debugger), operations (Operator), guidance (Advisor), and infrastructure code (IaC). More agents would mean more routing complexity without clear benefit. Fewer agents would mean cramming multiple personas into single prompts.

Why pattern-based routing instead of LLM classification?

Speed and cost. Pattern matching happens in microseconds without an API call. LLM classification would add latency and token cost to every request. The pattern approach handles 90% of cases correctly, and for edge cases, each agent's canHandle method provides a secondary check.

Why hand off context explicitly instead of sharing conversation history?

Token efficiency. A full conversation history grows quickly and most of it is not relevant to the receiving agent. By passing structured handoff context, the receiving agent gets what it needs to continue without paying for irrelevant earlier messages.

Why capability levels on agents?

Defense in depth. Even if prompt injection somehow bypassed the security filters and convinced an agent to try deleting a pod, READ_ONLY agents simply do not have the tool available. The capability level is enforced at tool binding time, not at prompt time.


What's Next

The handoff system could become smarter with learned routing. Instead of static patterns, the system could learn from successful interactions which agent combinations work best for which types of requests.

And there is still the state machine and meta-tool system to cover, but that is another article.


Conclusion

Going from a single chatbot to a multi-agent system changed how I think about AI assistants. The single-agent approach forces you to write one prompt that handles everything. That works for simple cases but creates confused behavior for complex workflows.

The multi-agent approach lets each specialist excel at its job. Triage is fast and focused. Debugger is methodical and thorough. Operator is safe and verified. They hand off to each other when needed, preserving context and building on each other's work.

The implementation is not complicated. Each agent is maybe 200-300 lines of code. The orchestrator is another 300. The real work is in thinking through the interaction patterns and defining clear boundaries between agents.

The assistant lives at /tools/ai on this blog if you want to see it in action. Ask it to check cluster status and notice the Triage agent. Ask why something is failing and watch the Debugger take over. Ask to restart a deployment and see the Operator's confirmation flow. The agents are visible in the UI, so you can follow along as they work.

Enjoyed this post? Give it a clap!

SeriesPlatform AI
Part 3 of 4

Comments