Building Golden Paths with Backstage: Part 1 - Foundation

December 22, 2025·10 min read

This is the first in a multi-part series documenting the actual implementation of a golden path for microservices deployment. Not the theory, but the real commands, the actual errors, and what it takes to go from empty cluster to working self-service platform.

There's no shortage of conference talks about platform engineering. Plenty of blog posts explaining why golden paths matter. What's harder to find is someone walking through the actual implementation: the YAML files, the permission errors, the "why isn't this working" moments that don't make it into the polished demos.

That's what this series is. By the end of today's work: a developer fills in 4 fields in Backstage, and 5 minutes later has a running microservice deployed to Kubernetes with health checks responding.

Here's how we built it.

What We're Building

A self-service platform where developers can deploy microservices without:

Manually creating Kubernetes namespaces
Configuring CI/CD pipelines
Setting up container registries
Writing ArgoCD applications
Requesting infrastructure tickets

The goal: minimize the distance between "I want to deploy a service" and "my service is running in production."

Phase 1: Shared Development Cluster Foundation

The Problem

Every team wants their own Kubernetes cluster. Understandable: clean isolation, no noisy neighbors, full control.

It's also expensive, operationally complex, and usually overkill for development workloads.

The alternative: one shared development cluster with proper namespace isolation. Teams get their own space, resource limits prevent noisy neighbor problems, and operations stays sane. The tradeoff is that you need to actually implement the isolation properly, which is what this section covers.

The Infrastructure Stack

Cluster:

Azure Kubernetes Service (AKS)
Named aks-shared-dev
Single cluster for all development teams

Namespace Management:

Crossplane for declarative namespace provisioning
Custom Resource Definition: XNamespaceClaim
Teams request namespaces, Crossplane creates them with isolation

Isolation Mechanisms:

# Resource quotas per team namespace
resourceQuota:
  requestsCpu: "2"
  requestsMemory: "4Gi"
  limitsCpu: "4"
  limitsMemory: "8Gi"
  pods: 50

# Default limits for containers
limitRange:
  cpuRequest: "100m"
  cpuLimit: "500m"
  memoryRequest: "128Mi"
  memoryLimit: "512Mi"

# Network policies enabled
enableNetworkPolicy: true

Each team gets a namespace with hard resource limits. No single team can consume the entire cluster. This sounds obvious, but without these limits, one team's runaway process or misconfigured HPA can starve everyone else.

Why Crossplane Instead of kubectl?

We could just run kubectl create namespace red-dev and be done. It's one command. Why introduce another abstraction layer?

I had this debate with myself for longer than I'd like to admit. The answer comes down to what happens after day one. Creating a namespace is easy. Creating a namespace with the right resource quotas, limit ranges, network policies, RBAC bindings, and labels every single time, consistently, without forgetting anything? That's where things fall apart.

Instead, we use Crossplane to create a NamespaceClaim:

apiVersion: platform.chrishouse.io/v1alpha1
kind: NamespaceClaim
metadata:
  name: red-dev
  namespace: crossplane-system
  labels:
    team: red
    environment: dev
    created-by: backstage
spec:
  parameters:
    targetCluster: shared-dev-cluster
    namespaceName: red-dev
    enableResourceQuota: true
    enableNetworkPolicy: true
    enableLimitRange: true

Why the extra abstraction?

Consistency - Every namespace gets resource quotas, limit ranges, and network policies automatically
GitOps - Namespace configuration is declarative and version-controlled
Self-service - Backstage templates can create namespaces without cluster credentials
Auditability - Clear record of who requested what and when

The Crossplane composition handles the actual Kubernetes API calls. Teams just declare what they need. The composition is essentially a contract: you give me a namespace name and a team label, I give you a fully configured namespace with all the guardrails in place.

Verification

After setting up the cluster and Crossplane:

$ kubectl get namespaceclaim -n crossplane-system
NAME      SYNCED   READY   AGE
red-dev   True     True    15m

$ kubectl get namespace red-dev -o yaml
apiVersion: v1
kind: Namespace
metadata:
  labels:
    team: red
    environment: dev
    managed-by: crossplane
  name: red-dev

$ kubectl get resourcequota -n red-dev
NAME                AGE
red-dev-quota       15m

$ kubectl get limitrange -n red-dev
NAME                AGE
red-dev-limits      15m

Namespace exists. Quotas applied. Limits enforced. This might not look like much, but it's the foundation everything else builds on. Get this wrong, and you'll be debugging resource contention issues and namespace configuration drift for months.

Phase 1 complete.

Phase 2: Microservice Deployment Template

The Goal

Now we have a place to put things. The next question: how do developers actually get their code running there?

The developer experience should be:

Click "Create Component" in Backstage
Fill in: Service Name, Team, Description, Owner
Wait 5 minutes
Service is running

Everything else happens automatically. No terminal. No YAML editing. No waiting for someone in ops to process a ticket.

What "Everything Else" Means

That "everything else" is doing a lot of heavy lifting. Behind that 4-field form, the platform needs to:

Create GitHub repository with proper structure
Configure repository permissions for GHCR (GitHub Container Registry)
Set up CI/CD pipeline to build and publish container images
Create namespace in shared dev cluster (via Crossplane)
Create ArgoCD application manifest
Commit GitOps configuration to platform repo
Let ArgoCD discover and deploy the service

That's seven separate systems that need to coordinate correctly. Any one of them failing silently would leave developers confused about why their service isn't running. This is where the abstraction of "fill in a form" starts looking a lot more complex than it sounds.

Let's break it down.

The Backstage Template

Core structure:

apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: nodejs-microservice-quickstart
  title: Node.js Microservice (Quick Start)
spec:
  parameters:
    - title: Service Information
      properties:
        service_name:
          title: Service Name
          type: string
        team:
          title: Team Name
          type: string
        description:
          title: Description
          type: string
        owner:
          title: Owner
          type: string

Four input fields. That's it. Everything else is either defaulted or derived from these values. The temptation to add more fields is strong, but every additional input is friction. Resist it.

Template Steps

Step 1: Scaffold Repository

- id: fetch-template
  name: Fetch Template
  action: fetch:template
  input:
    url: ./skeleton
    values:
      serviceName: ${{ parameters.service_name }}
      team: ${{ parameters.team }}
      description: ${{ parameters.description }}

Backstage copies the template skeleton and replaces variables.

The skeleton includes:

Node.js application with Express
Dockerfile for container builds
GitHub Actions workflow for CI/CD
Helm chart for Kubernetes deployment
Health check endpoints (/health, /ready)

Step 2: Create GitHub Repository

- id: publish-github
  name: Publish to GitHub
  action: publish:github
  input:
    repoUrl: github.com?owner=crh225&repo=${{ parameters.service_name }}
    defaultBranch: main

Step 3: Configure Repository Permissions (Critical Step)

This is where it gets interesting, and where I wasted more time than I'd like to admit.

GitHub Container Registry requires specific repository permissions. By default, new repositories have workflow permissions set to read. This seems fine until you try to push a container image. The build passes, the login succeeds, and then:

denied: permission_denied: write_package

The error message is unhelpful. You'll search for PAT token issues, GHCR authentication problems, Docker login failures. None of that is the problem. The problem is a single checkbox in the repository settings that nobody told you about.

We could fix this manually for each repository, but that defeats the purpose of automation. Better: automate it with a custom Backstage action that configures permissions immediately after repository creation.

Custom action: armportal:github:configure-repo

export const configureGitHubRepoAction = (options?: { token?: string }) => {
  return createTemplateAction({
    id: 'armportal:github:configure-repo',
    async handler(ctx) {
      const { repoUrl, token } = ctx.input;
      const [, owner, repo] = repoUrl.match(/github\.com\?.*owner=([^&]+).*repo=([^&]+)/);

      const octokit = new Octokit({ auth: token });

      // Set workflow permissions to 'write'
      await octokit.request('PUT /repos/{owner}/{repo}/actions/permissions/workflow', {
        owner,
        repo,
        default_workflow_permissions: 'write',
        can_approve_pull_request_reviews: true,
      });

      // Enable GitHub Actions
      await octokit.request('PUT /repos/{owner}/{repo}/actions/permissions', {
        owner,
        repo,
        enabled: true,
        allowed_actions: 'all',
      });
    },
  });
};

Used in template:

- id: configure-repo
  name: Configure Repository Permissions
  action: armportal:github:configure-repo
  input:
    repoUrl: github.com?owner=crh225&repo=${{ parameters.service_name }}

Now GHCR publishing works automatically. Every new repository gets the right permissions from the start. No manual intervention, no mysterious failures on the first build.

Step 4: Create GitOps Configuration

- id: create-gitops-pr
  name: Create GitOps PR
  action: armportal:create-pr
  input:
    repoUrl: github.com?owner=crh225&repo=ARMServicePortal
    branch: add-${{ parameters.service_name }}
    title: "Add ${{ parameters.service_name }} to platform"
    files:
      - path: infra/crossplane/claims/aks-shared-dev/${{ parameters.team }}-dev/namespace-claim.yaml
        content: ${{ steps['prepare-gitops'].output.namespaceClaim }}
      - path: infra/quickstart-services/${{ parameters.service_name }}/argocd-application.yaml
        content: ${{ steps['prepare-gitops'].output.argoApplication }}

This creates a pull request in the platform repository with:

Namespace claim for Crossplane
ArgoCD application manifest

Merge the PR. GitOps takes over. This separation matters: the template creates the intent (a PR), but a human approves the actual deployment. You can make this automatic if you want, but having that gate gives teams a moment to review what's about to happen.

The CI/CD Pipeline

The template generates a complete GitHub Actions workflow for each new service. Nothing fancy here, just the standard build-and-push pattern:

name: Build and Push to GHCR

on:
  push:
    branches: [main]

jobs:
  build:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write

    steps:
      - uses: actions/checkout@v3

      - name: Convert repository name to lowercase
        id: repo
        run: echo "repository=${{ github.repository }}" | tr '[:upper:]' '[:lower:]' >> $GITHUB_OUTPUT

      - name: Log in to GitHub Container Registry
        uses: docker/login-action@v2
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build and push
        uses: docker/build-push-action@v4
        with:
          context: .
          push: true
          tags: ghcr.io/${{ steps.repo.outputs.repository }}:latest

Key detail: lowercase repository name. GHCR requires it, and GitHub repository names can have uppercase letters. This mismatch will cause failures if you forget to normalize. The workflow handles it with that tr command, but I learned this the hard way after debugging "image not found" errors for an embarrassingly long time.

ArgoCD Service Discovery

Problem: Every time we create a new service, we'd need to manually register it with ArgoCD. That's exactly the kind of manual step that makes developers go around the platform.

Solution: App-of-Apps pattern. ArgoCD watches a directory. When new files appear, it creates applications automatically.

Created infra/argocd/apps/quickstart-services-app.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: quickstart-services
  namespace: argocd
spec:
  project: applications
  source:
    repoURL: https://github.com/crh225/ARMServicePortal.git
    targetRevision: main
    path: infra/quickstart-services
    directory:
      recurse: true
      include: '*/argocd-application.yaml'
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

ArgoCD watches infra/quickstart-services/. When it finds a new argocd-application.yaml, it creates the application automatically. The template commits the application manifest, and within a few minutes ArgoCD notices the new file and deploys the service.

No manual ArgoCD registration required. No "please add my app to ArgoCD" tickets.

The First Error: ServiceMonitor CRD Missing

Everything was working. Repository created, permissions configured, image built and pushed. ArgoCD picked up the application manifest and started syncing.

Then it failed. ArgoCD showed:

OutOfSync
Missing

Error: failed to discover server resources for group version monitoring.coreos.com/v1:
the server could not find the requested resource

The template included a ServiceMonitor resource for Prometheus metrics, because of course you want metrics for your services. But Prometheus Operator wasn't installed on the cluster yet. The CRD didn't exist. ArgoCD couldn't apply something that the cluster didn't understand.

This is a common trap with golden path templates: including "best practices" resources that depend on infrastructure that doesn't exist yet. The fix is simple, but the lesson is important. Don't put aspirational resources in your default template. :D

Fix: Disable ServiceMonitor by default in template.

# helm/values.yaml
serviceMonitor:
  enabled: false  # Set to true after installing Prometheus Operator

Re-sync. Success. Enable it later when Prometheus is actually running.

Verification: pricing-api Deployment

Time to test the whole pipeline end-to-end. I created pricing-api via the Backstage template, filled in the four fields, and waited:

$ kubectl get pods -n red-dev
NAME                          READY   STATUS    RESTARTS   AGE
pricing-api-6f4b8d9c7-8xk2p   1/1     Running   0          2m
pricing-api-6f4b8d9c7-m7n4q   1/1     Running   0          2m

$ kubectl port-forward -n red-dev svc/pricing-api 8080:80
Forwarding from 127.0.0.1:8080 -> 3000

$ curl http://localhost:8080/health
{"status":"healthy","timestamp":"2025-12-22T15:30:00.000Z"}

Two pods running. Health checks responding. The entire chain worked: Backstage created the repo, configured GHCR permissions, GitHub Actions built and pushed the image, the GitOps PR got merged, ArgoCD deployed the application.

Time from Backstage form submission to running service: 4 minutes, 37 seconds.

That number matters. If this took 30 minutes, developers would start looking for shortcuts. If it took 2 days waiting for approvals, they'd definitely route around the platform. Under 5 minutes is fast enough that the golden path becomes the path of least resistance.

Phase 2 complete.

View interactive diagrams →

What Actually Happened

Starting from an empty shared development cluster, we built:

Infrastructure:

Crossplane namespace provisioning with resource quotas
Shared AKS cluster for all development teams
Namespace isolation with limits and network policies

Automation:

Backstage template for Node.js microservices
Custom action for automatic GHCR permissions
CI/CD pipeline for container builds
ArgoCD App-of-Apps for service discovery
GitOps workflow for all deployments

Developer Experience:

4 input fields in Backstage
~5 minutes to running service
No manual configuration required
No cluster credentials needed
No infrastructure tickets

What worked:

End-to-end automation from form to deployment
Automatic GHCR configuration
Service discovery via App-of-Apps pattern
Resource isolation in shared cluster

What didn't (initially):

ServiceMonitor CRD dependency
Repository name case sensitivity in GHCR
ArgoCD not watching quickstart-services directory

All fixed. System working. Each of these issues took time to debug, but once fixed, they stay fixed. Every developer who uses the template after this benefits from the lessons learned.

What's Next

This gets us from zero to deployed service. A developer can fill out a form and have a running microservice in under 5 minutes. But we're missing some important pieces:

Networking:

External ingress (currently ClusterIP only)
Istio service mesh for mTLS
DNS configuration

Developer Workflow:

Preview environments per pull request
Automated testing in pipelines
Rollback mechanisms

Security:

Policy enforcement (OPA/Kyverno)
Secret management (External Secrets Operator)
Image scanning

Those are the next phases. Each one adds complexity, and each one needs to be optional until it's required.

The principle remains: make the golden path faster than the alternative.

If creating a service this way takes 5 minutes, and the manual alternative takes days of tickets and approvals, developers will use the platform.

If the platform becomes slower or more restrictive than doing it manually, they'll route around it.

That's the balance we're building toward.

Implementation Repository

Full implementation available at: github.com/crh225/ARMServicePortal

Key files referenced:

Backstage template: backstage/templates/nodejs-quickstart/template.yaml
Custom GitHub action: backstage/plugins/arm-portal-backend/src/scaffolder/actions/configureGitHubRepo.ts
App-of-Apps: infra/argocd/apps/quickstart-services-app.yaml
Crossplane composition: infra/crossplane/platform/namespace-composition.yaml

Next in series: Part 2 will cover preview environments: automatically creating ephemeral deployments for every pull request, with DNS, TLS certificates, and automatic cleanup.

What We're Building

Phase 1: Shared Development Cluster Foundation

The Problem

The Infrastructure Stack

Why Crossplane Instead of kubectl?

Verification

Phase 2: Microservice Deployment Template

The Goal

What "Everything Else" Means

The Backstage Template

Template Steps

The CI/CD Pipeline

ArgoCD Service Discovery

The First Error: ServiceMonitor CRD Missing

Verification: pricing-api Deployment

What Actually Happened

What's Next

Implementation Repository

Tags:

Comments

What We're Building

Phase 1: Shared Development Cluster Foundation

The Problem

The Infrastructure Stack

Why Crossplane Instead of kubectl?

Verification

Phase 2: Microservice Deployment Template

The Goal

What "Everything Else" Means

The Backstage Template

Template Steps

The CI/CD Pipeline

ArgoCD Service Discovery

The First Error: ServiceMonitor CRD Missing

Verification: pricing-api Deployment

What Actually Happened

What's Next

Implementation Repository

Tags:

Related Posts

Platform AI Part 4: Building a Claude Code Plugin to Develop the AI Itself

Platform AI Part 3: Building a Multi-Agent System for DevOps

Platform AI Part 2: Adding Agentic Tool Use for AKS Management

Comments