Building Golden Paths with Backstage: Part 2 - Preview Environments — hero banner

December 23, 2025·15 min read

In Part 1, we built the foundation: a shared development cluster, namespace isolation, and a Backstage template that deploys a running service in under 5 minutes.

Today we add something developers actually love: preview environments.

If you've ever waited for a staging environment to free up so you could test your changes, or merged a PR only to find it broke something obvious that would have been caught with a quick manual test, you understand the problem. Shared staging environments create bottlenecks. Developers queue up behind each other, or worse, deploy over each other's changes and then spend time debugging phantom issues.

Preview environments solve this by giving every pull request its own isolated deployment with a unique URL. Code reviewers can click a link in the PR and see the actual running application. Not screenshots, not local recordings, the real thing. QA can test changes before they hit the main branch. Product managers can review features without asking developers to deploy something special for them.

The concept is simple: PR opens, environment spins up, PR closes, environment disappears. The implementation? That's where it gets interesting.

Here's what we built, and what broke along the way.


What We're Building

When a developer opens a pull request:

  1. GitHub Actions builds the PR branch into a container image
  2. A GitOps workflow creates an ephemeral namespace
  3. ArgoCD deploys the preview environment
  4. A unique URL is generated: pricing-api-pr-1-red.chrishouse.io
  5. The PR gets a comment with the preview link
  6. When the PR closes, everything is cleaned up automatically

The hard part isn't the workflow. It's the routing.


The Architecture Challenge

Here's where our design decisions from Part 1 created an interesting puzzle. We have a hub-spoke cluster topology:

  • Hub Cluster: Handles all external ingress, runs ArgoCD, Crossplane, Backstage
  • Dev Spoke Cluster: Runs application workloads, has no external ingress

This separation is intentional. The hub cluster is the control plane: it manages infrastructure, handles GitOps, and serves as the single entry point from the internet. The spoke cluster runs actual workloads, isolated from the management plane. It's a common pattern for enterprise Kubernetes deployments where you want to keep your cattle separate from your pets.

But preview environments run on the dev spoke. And DNS points to the hub.

Traffic flow needs to be:

Internet → Hub Istio Gateway → ??? → Dev Spoke → Preview Pod

That middle part is the question mark we need to solve. How do you route traffic from one cluster to another when they're on different networks? The clusters can talk to each other through VNet peering, but Kubernetes services don't automatically span clusters. The hub's Istio gateway has no native way to forward traffic to a service running in a completely different cluster.

This is a solved problem in the Kubernetes ecosystem. There are several approaches, but each comes with tradeoffs. Let's walk through what we tried.


Phase 1: The Preview Workflow

Before tackling the routing problem, let's set up the workflow that will trigger everything. This part is relatively straightforward: a GitHub Actions workflow that fires on pull request events.

The workflow lives in the Backstage template skeleton, which means every service created through the golden path automatically gets preview environment support. Developers don't have to configure anything; it just works.

# .github/workflows/preview.yml
name: Preview Environment

on:
  pull_request:
    types: [opened, synchronize, reopened, closed]

env:
  REGISTRY: ghcr.io
  SERVICE_NAME: pricing-api
  TEAM_NAME: red
  GITOPS_REPO: crh225/ARMServicePortal

jobs:
  deploy-preview:
    if: github.event.action != 'closed'
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
      pull-requests: write

    steps:
      - uses: actions/checkout@v3

      - name: Set environment variables
        id: vars
        run: |
          echo "pr_number=${{ github.event.pull_request.number }}" >> $GITHUB_OUTPUT
          echo "namespace=${TEAM_NAME}-dev-pr-${{ github.event.pull_request.number }}" >> $GITHUB_OUTPUT
          echo "hostname=${SERVICE_NAME}-pr-${{ github.event.pull_request.number }}-${TEAM_NAME}.chrishouse.io" >> $GITHUB_OUTPUT
          echo "image_tag=pr-${{ github.event.pull_request.number }}" >> $GITHUB_OUTPUT

      - name: Build and push preview image
        uses: docker/build-push-action@v4
        with:
          push: true
          tags: ghcr.io/${{ github.repository }}:pr-${{ github.event.pull_request.number }}

      - name: Create GitOps manifests
        run: |
          # Creates NamespaceClaim and ArgoCD Application
          # Commits to ARMServicePortal repo
          # ArgoCD discovers and deploys

The workflow creates two files in the platform repository:

NamespaceClaim - Creates the isolated namespace:

apiVersion: platform.chrishouse.io/v1alpha1
kind: NamespaceClaim
metadata:
  name: red-dev-pr-1
  labels:
    ephemeral: "true"
    pr-number: "1"
spec:
  parameters:
    targetCluster: shared-dev-cluster
    namespaceName: red-dev-pr-1
    enableResourceQuota: true

ArgoCD Application - Deploys the preview:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: pricing-api-pr-1
  namespace: argocd
  labels:
    ephemeral: "true"
spec:
  source:
    repoURL: https://github.com/crh225/pricing-api
    targetRevision: feature-branch
    path: helm
    helm:
      parameters:
        - name: image.tag
          value: pr-1
        - name: istio.preview.enabled
          value: "true"
        - name: istio.preview.hostname
          value: pricing-api-pr-1-red.chrishouse.io
  destination:
    name: shared-dev-cluster
    namespace: red-dev-pr-1

The workflow uses a personal access token (GITOPS_TOKEN) to commit to the platform repository. This feels like a hack but is actually the standard GitOps pattern: your application repo triggers changes in your infrastructure repo, and your GitOps tool (ArgoCD in our case) picks up those changes and applies them.

This part worked immediately. Within a couple minutes of opening a PR, pods were running and the service was created. The Crossplane NamespaceClaim created the isolated namespace, ArgoCD deployed the application, and we had a working preview environment.

Except nobody could reach it. Now comes routing.


Phase 2: The Routing Problem

This is the part that took the longest to solve. Not because the concepts are hard, but because cloud networking has a way of surprising you with limitations that seem arbitrary until you understand the underlying infrastructure.

Attempt 1: Internal LoadBalancer

The obvious first attempt: put an Istio east-west gateway on the dev spoke with an internal (private) LoadBalancer. The hub cluster is on the same Azure VNet peering, so it should be able to reach an internal IP in the dev spoke's VNet.

service:
  type: LoadBalancer
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"

Hub cluster gets the internal IP. ServiceEntry points to it. VirtualService routes preview traffic.

Result:

$ curl http://10.1.0.158/health
# ... hangs ...
# 100% packet loss

Root cause: Azure internal LoadBalancers are not accessible across VNet peering by default. The hub cluster (10.0.0.0/16) cannot reach the dev spoke's internal LoadBalancer (10.1.0.158 in 10.1.0.0/16).

This was frustrating because everything looked correct. VNet peering was connected and showing "Connected" status in the Azure portal. NSG rules allowed VirtualNetwork traffic in both directions. Route tables looked fine. I spent a good hour checking every setting.

The issue is that Azure internal LoadBalancers use a different networking path than regular VNet traffic. They're implemented with a load balancer frontend IP that lives in a special Azure networking plane, and that plane doesn't traverse VNet peering without additional configuration. The VM-to-VM traffic works fine over peering; the traffic to the LoadBalancer frontend IP doesn't.

Azure networking strikes again.

The Azure-recommended solution for this exact problem is Private Link. You create a Private Link Service that fronts the internal LoadBalancer, then create a Private Endpoint in the hub VNet that connects to that service. Traffic flows through Azure's backbone, never touching the public internet, and you get a private IP in the hub VNet that routes to the dev spoke's LoadBalancer.

Pros: Proper Azure-native cross-VNet LoadBalancer access. Fully private. Works reliably.

Cons: Adds complexity (two more Azure resources to manage), cost (~$7/month for the private endpoint), and another moving part that can break. For production environments handling sensitive traffic, this is the right answer. For dev preview environments where I'm trying to minimize costs, it felt like overkill.

Attempt 3: NodePort with Hardcoded IPs

I considered bypassing the LoadBalancer entirely. Kubernetes NodePort services expose a port on every node's IP address. Since VNet peering does work for node-to-node traffic, I could add all the dev spoke's node IPs to the ServiceEntry.

endpoints:
  - address: "10.1.0.4"  # node1
  - address: "10.1.0.5"  # node2
  - address: "10.1.0.6"  # node3

Pros: Works with VNet peering. No additional Azure services needed.

Cons: Hardcoded IPs. The moment the cluster scales up, scales down, or nodes get replaced (which happens regularly with AKS upgrades), this breaks. I'd have to build automation to keep the ServiceEntry in sync with the node pool, and that felt like fighting Kubernetes rather than working with it. Not acceptable for anything beyond a quick test.

Attempt 4: Istio Multi-Cluster with Remote Secrets

Istio was designed for exactly this scenario. Istio's multi-cluster support allows service meshes to span multiple Kubernetes clusters, with traffic routed seamlessly between them. It is a way how large organizations run Istio across regions, clouds, and network boundaries.

This is what worked.


Phase 3: Istio Multi-Cluster

The Concept

Istio multi-cluster is one of those features that sounds complex but solves a real problem elegantly. At its core, it allows Istio's control plane (istiod) to discover and route to services running in other clusters, as if they were local services.

The key insight is that cross-cluster communication doesn't have to be complicated if your service mesh understands the topology. Instead of manually configuring routing rules and endpoints, you tell Istio "here's another cluster you should know about" and it handles the rest.

The key components:

  1. Remote Secrets: Kubeconfig credentials that allow istiod in one cluster to query the Kubernetes API of another cluster. Once istiod can list services and endpoints in the remote cluster, it can route traffic there.
  2. East-West Gateway: A dedicated ingress point for cross-cluster traffic. Unlike the north-south gateway that handles external traffic, the east-west gateway handles internal mesh traffic between clusters.
  3. Network Topology Labels: Labels on the istio-system namespace that tell Istio which network each cluster belongs to. This helps Istio understand when traffic needs to cross a network boundary.

Implementation

Step 1: Label namespaces with network topology

# Hub cluster
kubectl label namespace aks-istio-system \
  topology.istio.io/network=hub-network

# Dev spoke
kubectl label namespace istio-system \
  topology.istio.io/network=shared-dev-network

Step 2: Create remote secrets

Remote secrets allow istiod in one cluster to discover services running in another cluster. The Istio documentation covers this well. Use istioctl create-remote-secret to generate the secret for each cluster, then apply it to the other cluster's Istio namespace.

See the Istio multi-cluster installation guide for the complete setup process.

Step 3: Change east-west gateway to public LoadBalancer

Here's where we make the pragmatic choice. The internal LoadBalancer didn't work due to Azure's networking model, and Private Link adds cost and complexity. So we use a public LoadBalancer instead.

# istio-eastwest-gateway-argocd-app.yaml
service:
  type: LoadBalancer
  annotations:
    # Removed: azure-load-balancer-internal annotation
    service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: /healthz/ready

Result: East-west gateway gets public IP 52.255.217.180.

This means the east-west gateway is technically internet-accessible. I'll discuss the security implications later, but the short version is: for a dev environment with no sensitive data, it's an acceptable tradeoff.

Why public instead of private?

This is a cost/complexity tradeoff. The "proper" Azure solution would be:

  1. Create an Azure Private Link Service exposing the east-west gateway
  2. Create a Private Endpoint in the hub VNet
  3. Route through the private endpoint

That adds ~$7/month for the private endpoint, plus complexity. For a personal lab environment running on my credit card, the public LoadBalancer at ~$3.65/month is acceptable.

Security implications:

The east-west gateway is now internet-accessible on port 80. However:

  • It only routes traffic to services with explicit VirtualService configurations
  • Services require the correct Host header to match
  • No sensitive data is exposed without intentional configuration
  • This is a dev environment for preview URLs, not production traffic

For production environments, I'd recommend:

  • Azure Private Link for fully private cross-cluster routing
  • Cilium ClusterMesh when Azure CNI adds support (blocked by GitHub issue #5194)
  • IP whitelisting on the east-west gateway if public access is required

Step 4: Configure hub routing

ServiceEntry tells the hub where to find the dev spoke:

apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: shared-dev-cluster
  namespace: istio-ingress
spec:
  hosts:
    - shared-dev.internal
  location: MESH_EXTERNAL
  ports:
    - number: 80
      name: http
      protocol: HTTP
  resolution: STATIC
  endpoints:
    - address: "52.255.217.180"  # East-west gateway public IP

VirtualService routes preview traffic to the ServiceEntry:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: preview-envs-vs
  namespace: istio-ingress
spec:
  hosts:
    - "*.chrishouse.io"
  gateways:
    - main-gateway
  http:
    - match:
        - headers:
            ":authority":
              regex: ".*-pr-[0-9]+-[a-z0-9]+\\.chrishouse\\.io"
      route:
        - destination:
            host: shared-dev.internal
            port:
              number: 80

The regex matches preview URLs like pricing-api-pr-1-red.chrishouse.io.

Step 5: Update Helm template for preview VirtualServices

The application's VirtualService needs to reference both the hub gateway and the cross-network gateway:

# helm/templates/virtualservice.yaml
spec:
  hosts:
    - {{ .Values.istio.preview.hostname }}
  gateways:
    - {{ .Values.istio.gateway }}
    {{- if .Values.istio.preview.enabled }}
    - istio-system/cross-network-gateway
    {{- end }}

Verification

The moment of truth. After all this configuration, does it actually work?

$ curl https://pricing-api-pr-1-red.chrishouse.io/health
{"status":"healthy","service":"pricing-api","timestamp":"2025-12-23T15:13:18.004Z"}

That healthy response took way too long to see, but there it is. Traffic is flowing from the internet, through the hub cluster, across the cluster boundary, and into the preview environment running on the dev spoke.

Here's the full traffic flow:

  1. DNS resolution: Browser looks up pricing-api-pr-1-red.chrishouse.io, gets 48.194.61.98 (hub Istio ingress)
  2. TLS termination: Hub's Istio gateway terminates TLS using the *.chrishouse.io wildcard certificate
  3. Pattern matching: Hub's VirtualService sees the hostname matches the preview regex pattern
  4. Cross-cluster routing: Request is forwarded to ServiceEntry shared-dev.internal
  5. External routing: ServiceEntry resolves to 52.255.217.180 (dev spoke's east-west gateway)
  6. Service routing: East-west gateway routes to the pricing-api service based on the Host header
  7. Response: The whole chain reverses, response arrives at the browser

Phase 4: The TLS Certificate Issue

Just when I thought we were done, there was one more surprise waiting.

The Problem

Testing the preview URL with the original naming scheme:

$ curl https://pricing-api-pr-1.red.chrishouse.io/health
curl: (60) SSL certificate problem: unable to get local issuer certificate

The routing works (we verified that with HTTP), but HTTPS is failing with a certificate error?

The original URL pattern was {service}-pr-{number}.{team}.chrishouse.io, something like pricing-api-pr-1.red.chrishouse.io. That's a two-level subdomain: pricing-api-pr-1 under red under chrishouse.io.

The existing wildcard certificate is *.chrishouse.io. And here's the thing about wildcard certificates that trips people up: they only match one level of subdomain, not arbitrary depth.

  • pricing-api.chrishouse.io → matches *.chrishouse.io
  • anything.chrishouse.io → matches *.chrishouse.io
  • pricing-api-pr-1.red.chrishouse.io → does NOT match *.chrishouse.io

The wildcard only replaces the single * portion. It doesn't recursively match nested subdomains.

Options Considered

Option A: Per-team wildcard certs (*.red.chrishouse.io)

I could create a wildcard certificate for each team's subdomain. *.red.chrishouse.io would match pricing-api-pr-1.red.chrishouse.io just fine.

Rejected. That means managing N certificates where N is the number of teams. Each new team requires provisioning a new certificate, configuring it in the gateway, and keeping track of renewals. Certificate management is already tedious; multiplying it doesn't help.

Option B: SAN certificate with all patterns

A single certificate with Subject Alternative Names (SANs) for each team pattern: *.chrishouse.io, *.red.chrishouse.io, *.blue.chrishouse.io, etc.

Rejected. Same problem: the certificate needs to be regenerated every time a new team is added. Plus there are limits on SAN entries, and it adds operational overhead.

Option C: Change URL pattern to single-level subdomain

Instead of {service}-pr-{number}.{team}.chrishouse.io, use {service}-pr-{number}-{team}.chrishouse.io. The team name becomes part of the single subdomain rather than its own level.

Accepted. It's the simplest solution and works with the existing wildcard certificate without any changes to certificate management.

The Fix

Changed URL pattern from:

{service}-pr-{number}.{team}.chrishouse.io  (two-level)

To:

{service}-pr-{number}-{team}.chrishouse.io  (single-level)

Updated files:

  1. backstage/templates/nodejs-quickstart/skeleton/.github/workflows/preview.yml
  2. backstage/templates/nodejs-quickstart/skeleton/helm/values.yaml
  3. infra/kubernetes/istio-hub/virtualservice-preview-envs.yaml
  4. pricing-api/.github/workflows/preview.yml

New regex pattern:

headers:
  ":authority":
    regex: ".*-pr-[0-9]+-[a-z0-9]+\\.chrishouse\\.io"

DNS Update

The final piece of the puzzle was DNS. I updated the wildcard DNS record *.chrishouse.io to point to the hub Istio ingress at 48.194.61.98. This means any subdomain that doesn't have a more specific A record will resolve to the hub gateway.

Importantly, existing services with explicit A records (like argohub.chrishouse.io and backstage.chrishouse.io pointing to nginx-ingress at 20.253.73.108) are unaffected. DNS resolution prefers more specific records over wildcards, so those services continue to work exactly as before.


Phase 5: Automatic Cleanup

Preview environments are only useful if they don't accumulate. Without cleanup, you'd end up with dozens of abandoned namespaces consuming cluster resources, each one a forgotten artifact of a PR from three months ago.

When a PR is merged or closed, the preview environment should be deleted automatically. This is the "ephemeral" part of ephemeral environments.

cleanup-preview:
  if: github.event.action == 'closed'
  runs-on: ubuntu-latest
  steps:
    - name: Checkout GitOps repository
      uses: actions/checkout@v3
      with:
        repository: crh225/ARMServicePortal
        token: ${{ secrets.GITOPS_TOKEN }}
        path: gitops

    - name: Remove preview environment manifests
      working-directory: gitops
      run: |
        rm -rf infra/quickstart-services/${SERVICE_NAME}/preview-pr-${PR_NUM}

    - name: Commit and push cleanup
      run: |
        git add .
        git commit -m "Cleanup preview environment for PR #${PR_NUM}"
        git push

The cleanup follows the same GitOps pattern as creation. The workflow deletes the manifests from Git, commits the change, and lets ArgoCD handle the rest. ArgoCD has prune: true in its sync policy, which means when it detects that a resource exists in the cluster but not in Git, it deletes it.

This is one of the elegant things about GitOps: cleanup is just another commit. Namespace, pods, services, VirtualService, all removed automatically when the ArgoCD Application manifest disappears from the repository. No custom cleanup scripts, no cron jobs scanning for orphaned resources, no manual intervention.


The Final Architecture

Internet
    │
    ▼
┌─────────────────────────────────────────┐
│ Hub Cluster (aks-mgmt-hub)              │
│                                         │
│ ┌─────────────────────────────────────┐ │
│ │ Istio Gateway (48.194.61.98)        │ │
│ │ - TLS termination (*.chrishouse.io) │ │
│ │ - VirtualService regex matching     │ │
│ └──────────────┬──────────────────────┘ │
│                │                        │
│ ┌──────────────▼──────────────────────┐ │
│ │ ServiceEntry (shared-dev.internal)  │ │
│ │ → 52.255.217.180                    │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────┐
│ Dev Spoke (aks-shared-dev)              │
│                                         │
│ ┌─────────────────────────────────────┐ │
│ │ East-West Gateway (52.255.217.180)  │ │
│ │ - Public LoadBalancer               │ │
│ └──────────────┬──────────────────────┘ │
│                │                        │
│ ┌──────────────▼──────────────────────┐ │
│ │ VirtualService (pricing-api)        │ │
│ │ - Routes to service in namespace    │ │
│ └──────────────┬──────────────────────┘ │
│                │                        │
│ ┌──────────────▼──────────────────────┐ │
│ │ pricing-api Service                 │ │
│ │ Namespace: red-dev-pr-1             │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────┘

View interactive diagram →


What We Learned

Building this feature took longer than expected, but most of the time wasn't spent on the preview workflow itself (that was straightforward). The complexity was in the cross-cluster routing, and specifically in working around Azure's networking limitations.

What Worked

  • GitOps-driven preview environments: The pattern of PR workflow → GitOps commit → ArgoCD deploy → cleanup commit is clean and reliable. Every state change is recorded in Git, which makes debugging and auditing trivial.
  • Istio multi-cluster with remote secrets: Once configured, this just works. Istio handles service discovery across clusters without any per-service configuration.
  • Single-level subdomain URLs: A simple URL pattern change avoided certificate complexity entirely.
  • Automatic cleanup: GitOps makes cleanup as reliable as deployment. If the manifest isn't in Git, the resource doesn't exist.

What Didn't Work (Initially)

  • Azure internal LoadBalancers: Not accessible across VNet peering without Private Link. This was a frustrating discovery because everything else about VNet peering works fine.
  • Two-level subdomain URLs: Wildcard certs only match one level. This is well-documented behavior, but easy to forget when designing URL schemes.

Key Decisions

Decision Choice Rationale
Cross-cluster routing Istio multi-cluster Industry standard, mTLS, service discovery
East-west gateway Public LoadBalancer Azure internal LBs not accessible cross-VNet; cheaper than Private Link
URL pattern Single-level subdomain Works with existing wildcard cert
Cleanup trigger PR close event Immediate, no TTL complexity

Cost vs Security Tradeoff

This is a personal lab environment running on my Azure subscription (and credit card). The architecture choices reflect that:

Option Monthly Cost Complexity Security
Public LoadBalancer (chosen) ~$3.65 Low Dev-acceptable
Azure Private Link ~$10.65 Medium Production-ready
VPN/ExpressRoute $50+ High Enterprise-grade

For production workloads, we could use Azure Private Link. For a dev environment where the only exposed services are ephemeral preview deployments with no sensitive data, the public LoadBalancer is a reasonable tradeoff.

The east-west gateway only routes traffic to services with explicit VirtualService configurations. It's not an open proxy.

Operational Costs

  • East-west gateway public IP: ~$3.65/month (Azure)
  • Additional Istio overhead: Minimal (already running Istio on both clusters)
  • Network egress: Hub → dev spoke traffic stays within Azure

Implementation Repository

Full implementation: github.com/crh225/ARMServicePortal

Key files:


Next in series: Part 3 will cover cost visibility, showing developers the real cost of their applications directly in Backstage with Azure Cost Management integration and resource tagging.

Enjoyed this post? Give it a clap!

SeriesBuilding Golden Paths with Backstage
Part 3 of 3

Comments