Platform AI Part 4: Building a Claude Code Plugin to Develop the AI Itself — hero banner

February 01, 2026·8 min read

This is Part 4 of the Platform AI series. Part 1 covers building the basic chatbot. Part 2 adds agentic tool use. Part 3 introduces the multi-agent system.

The AI assistant from the first three parts of this series has grown into a real system. Six specialized agents, a hallucination guard with four verification layers, 81 end-to-end tests, tool retry logic, parallel execution, budget tracking, and a self-evaluation pipeline. Making changes to it now requires understanding how all those pieces fit together.

I found myself repeating the same workflow every time I needed to fix a bug or add a feature. Export a chat session from the assistant to see what went wrong. Read through the codebase to find the relevant files. Make changes. Run the e2e suite. Check for regressions. That process is exactly the kind of structured, multi-step workflow that AI excels at.

So I built a Claude Code plugin to do it for me.

Live: https://chrishouse.io/tools/ai


What Is a Claude Code Skill?

Claude Code is Anthropic's CLI tool for using Claude directly in your terminal. It reads your codebase, edits files, runs commands, and understands project context. But out of the box, it's general-purpose. It doesn't know that my project has a specific agent architecture, that tests live in e2e/, or that every async function needs try/catch with bracket-prefixed logging.

Skills (also called slash commands) let you inject domain-specific expertise into Claude Code. A skill is a markdown file that acts as a system prompt, activated by typing /skill-name in the CLI. When you invoke a skill, Claude Code loads that markdown as additional context, effectively turning a general-purpose AI into a specialist for your project.

The skill lives at .claude/commands/ai-dev-studio.md in the repo. There's also a plugin registration that tells Claude Code this skill exists:

{
  "name": "ai-dev-studio",
  "version": "1.0.0",
  "description": "AI Development Studio - Multi-agent system for building and improving the AI DevOps assistant",
  "author": {
    "name": "Chris House",
    "email": "[email protected]"
  }
}

The AI Dev Studio

The skill is called /ai-dev-studio and it transforms Claude Code into a six-agent development team. Each agent has a specific role in the development lifecycle:

User Request
     │
     ▼
┌─────────────┐
│  Classify    │ ── New Feature? Bug Fix? Improvement? Review?
└─────────────┘
     │
     ▼
┌─────────────────────────────────────────────────┐
│                Agent Pipeline                    │
│                                                  │
│  ARCHITECT → DEVELOPER → QA → REVIEWER → PRODUCT│
│                                                  │
│  Bug fix:    DEBUGGER → DEVELOPER → QA           │
│  Improvement: REVIEWER → DEVELOPER → QA          │
└─────────────────────────────────────────────────┘

When I type /ai-dev-studio and paste a chat session export or describe a bug, the skill classifies the request and activates the right agents in sequence.


The Six Agents

Architect

Architect handles design work. When I ask it to add a new feature, it reads the existing codebase first, studies how similar features are built, then produces a blueprint with specific file paths, line numbers, data flows, and integration points. It knows about the BaseAgent pattern, the agent orchestrator, the tool definition conventions, and the module structure.

Developer

Developer writes the actual code. It follows the existing conventions exactly because the skill prompt includes specific code style examples from the codebase. Named functions instead of arrow functions. Try/catch with bracket-prefixed console logs like [ModuleName]. Zod schemas for tool inputs. The skill enforces these patterns so the output matches the rest of the codebase without manual cleanup.

QA

QA writes Playwright e2e tests and runs them. The skill includes the exact test fixture pattern from e2e/fixtures.js so QA knows to call clearBrowserState() first, use aiAssistant.sendMessage() with appropriate timeouts, and check for both positive results and error states. After writing tests, it runs them and reports results.

Debugger

Debugger is the investigator. When a test fails or a session export shows unexpected behavior, Debugger reads the error, traces the execution path through the code, forms hypotheses ranked by confidence, and proposes minimal fixes. It uses a structured format with root cause analysis, evidence, and validation steps.

Reviewer

Reviewer checks for code quality, security, DRY violations, and convention adherence. It produces prioritized findings: P0 (must fix), P1 (should fix), P2 (nice to have). The skill includes specific checklists for pattern adherence, error handling, security (input validation, injection risks), and code quality.

Product

Product is the final quality gate. It reviews user-facing elements like error messages, success states, and response formatting against what the skill calls "Apple-level quality standards." The idea is that every user-visible string should be clear, actionable, and helpful rather than generic.


The Session Export Workflow

The most powerful pattern is feeding the plugin exported chat sessions from the AI assistant itself. The assistant's frontend has an export button that dumps the full session as JSON, including every message, tool call, tool result, agent routing decision, thinking blocks, and warnings.

When I paste that export into /ai-dev-studio, the Debugger agent can trace exactly what happened:

User asked: "check my clusters status"
  → Triage agent selected (confidence: 0.70)
  → get_cluster_status called → ERROR (4 times)
  → Model generated response with cluster data table
  → No fabrication warning shown

Problem: All tools failed but the model fabricated data

This is how I found and fixed the fabrication detector issues. The session showed that get_cluster_status failed four times with timeouts, but the assistant still presented a complete table with cluster names, regions, versions, and node counts. All fabricated. The Debugger agent traced the issue through the hallucination guard's four verification layers, identified why the detection missed it (markdown tables aren't code blocks, so the config-pattern matching didn't trigger), and proposed the exact fix.

In a later session, the IaC agent gave legitimate YAML suggestions for improving ArgoCD configurations, but the fabrication detector flagged them as hallucinations because no tools were called. The Debugger traced that false positive to the processHallucinationGuard function where the logic assumed "no tools + YAML = fabrication" without considering that advisory agents work from pre-loaded context.

Both fixes, across three files, came from the same workflow: export session, paste into /ai-dev-studio, let the agents trace the problem.


How It Knows the Codebase

The skill prompt is 600 lines of markdown that encodes the project's architecture, conventions, and patterns. It includes the agent capability system:

const AgentCapability = {
  READ_ONLY: 'read_only',     // Triage, Advisor
  SUGGEST: 'suggest',          // Suggestions requiring confirmation
  EXECUTE: 'execute',          // Operator - confirmation required
  AUTONOMOUS: 'autonomous'     // Rare
};

The BaseAgent class pattern showing how to define tools, canHandle, system prompts, and handoff logic. The DynamicStructuredTool pattern for defining new tools with Zod schemas. The e2e test fixture pattern. And the meta-tool pattern for composite operations like diagnose_pod_crash.

This context means Claude Code doesn't have to rediscover these patterns every time. When the Developer agent writes a new agent, it follows the exact pattern from the skill. When QA writes tests, it uses the exact fixture API.


Request Classification

The skill routes requests to different agent pipelines based on keywords:

Request Type Keywords Pipeline
New Feature "add", "create", "build" Architect, Developer, QA, Reviewer, Product
Bug Fix "fix", "broken", "error" Debugger, Developer, QA
Improvement "improve", "optimize", "refactor" Reviewer, Developer, QA
Test Coverage "test", "e2e", "playwright" QA
Architecture "design", "architect", "plan" Architect
Code Review "review", "check", "quality" Reviewer, Product

This mirrors how the AI assistant itself routes requests to its own agents. There's something recursive about an AI development tool that uses the same multi-agent routing pattern as the AI it's building.


Context Preservation Between Agents

When agents hand off to each other, the skill specifies a structured handoff format:

### Handoff: Debugger → Developer

**Work Completed**:
- Traced fabrication detection through 4 layers of hallucination guard
- Identified false positive trigger at response.js:135

**Key Findings**:
- Advisory agents (iac, advisor) legitimately produce YAML without tools
- The "no tools + config = fabrication" heuristic is too aggressive

**Artifacts**:
- Root cause analysis with file:line references

**Next Steps**:
- Add agent type exemption for advisory agents
- Broaden suggestion detection patterns in guard.js

This means the Developer agent gets exactly what it needs to implement the fix without re-reading the entire codebase. The Debugger already did the investigation and narrowed down the specific lines that need to change.


Quality Gates

Each agent phase has a quality gate before handing off:

  • Architect: Blueprint must be complete with specific file paths and line numbers
  • Developer: Code must parse without errors
  • QA: Tests must be valid and pass
  • Debugger: Root cause must have supporting evidence
  • Reviewer: All P0 issues must be addressed
  • Product: User-facing strings must be clear and actionable

If a gate fails, the pipeline loops back. QA failure goes to Debugger for diagnosis, then back to Developer for the fix, then back to QA.


Real Example: Fixing the Fabrication Detector

Here's how a real bug fix flowed through the plugin. I exported a session where the IaC agent's YAML suggestions triggered a false fabrication warning, and pasted it into /ai-dev-studio.

Debugger analyzed the session export and traced the issue through three code paths:

  1. response.js:135 - The "no tools + suspicious content" check didn't account for advisory agents
  2. guard.js:384-389 - The suggestion detection only matched narrow K8s manifest patterns (apiVersion, kind), missing ArgoCD patterns like retry:, syncOptions:, finalizers:
  3. The all-tools-failed case where the model fabricated data presented in markdown tables rather than code blocks

Developer implemented three targeted fixes:

  1. Added isAdvisoryAgent check in response.js to exempt IaC and Advisor agents
  2. Added iacSuggestionPatterns array in guard.js with 14 ArgoCD/Helm-specific patterns
  3. Added toolCallsAttempted counter in ai-chat.js to detect when all tool calls failed but the response contains structured data

QA ran the full 81-test e2e suite. Result: 81 passed, 3 flaky (API rate limits), 1 failed (pre-existing wrangler network issue). No regressions from the changes.

Three files changed, 51 lines added, 7 removed. The entire flow from session export to validated fix happened in a single Claude Code conversation.


Agents Building Agents

The most interesting aspect of this setup is that it's agents all the way down. The AI assistant running at /tools/ai uses six specialized agents (Triage, Operator, Debugger, IaC, Advisor, Network) to handle user requests. The development tool that builds and maintains that assistant uses six different specialized agents (Architect, Developer, QA, Debugger, Reviewer, Product) coordinated through a Claude Code skill.

Both systems use the same core patterns: intent classification to route to the right specialist, structured handoffs to preserve context, capability levels to control what each agent can do, and quality gates to catch issues before they ship.

The difference is scope. The runtime agents work with Kubernetes clusters and infrastructure. The development agents work with the codebase that defines those runtime agents. But the architecture is the same.


Setting It Up

The plugin requires two files in your repo:

Plugin registration at .claude/plugins/ai-dev-studio/.claude-plugin/plugin.json:

{
  "name": "ai-dev-studio",
  "version": "1.0.0",
  "description": "AI Development Studio for the AI DevOps assistant"
}

Skill definition at .claude/commands/ai-dev-studio.md:

---
name: ai-dev-studio
description: Multi-agent AI Development Studio...
---

# AI Development Studio

You are the AI Development Studio...
[600 lines of agent specs, patterns, and workflows]

Then in Claude Code, type /ai-dev-studio followed by your request. The skill activates and Claude Code becomes the development studio.


Conclusion

Building a development tool for an AI system using the same multi-agent patterns as the AI system itself felt like the natural evolution of this project. The AI assistant at /tools/ai manages Kubernetes infrastructure through specialized agents. The Claude Code plugin at /ai-dev-studio manages the assistant's codebase through specialized agents. Same architecture, different domain.

The practical value is real. Exporting a broken session from the assistant, pasting it into the plugin, and getting a traced root cause with a validated fix across multiple files in a single conversation is a workflow I now use regularly. The plugin knows the codebase patterns deeply enough that the Developer agent's output matches the existing code style without manual cleanup, and the QA agent runs the actual e2e suite to catch regressions before they ship.

The 600-line skill definition is just markdown describing how the codebase works and what quality standards to follow. The leverage you get from having an AI that understands your specific architecture, conventions, and testing patterns, compounds with every bug fix and feature you build through it.

Enjoyed this post? Give it a clap!

SeriesPlatform AI
Part 4 of 4

Comments