Building a Hub-Spoke AKS Architecture with Istio Service Mesh

December 20, 2025·4 min read

After running a single-cluster Kubernetes setup for a while, I recently migrated to a hub-spoke architecture. This post covers why I made the change, how the traffic routing works, and the key decisions around where to place the Istio ingress gateways.

Why Hub-Spoke?

The single-cluster approach is simple but has limitations:

Blast radius - A misconfiguration or resource exhaustion affects everything
Scaling constraints - Platform services compete with applications for resources
Upgrade risk - Cluster upgrades put everything at risk simultaneously

The hub-spoke pattern separates concerns:

Hub cluster - Runs platform/control plane services (ArgoCD, Crossplane, Backstage)
Spoke cluster(s) - Runs application workloads (APIs, frontends, microservices)

The Architecture

Here's what my setup looks like:

View Interactive Diagram →

                              +---------------------------+
                              |   Cloudflare (DNS/CDN)    |
                              +-------------+-------------+
                                            |
               +----------------------------+----------------------------+
               |                                                         |
               v                                                         v
+---------------------------+                             +---------------------------+
|  Hub Traffic              |                             |  Spoke Traffic            |
|  backstage.chrishouse.io  |                             |  portal.chrishouse.io     |
|  argocd.chrishouse.io     |                             |  portal-api.chrishouse.io |
+---------------------------+                             |  blog.chrishouse.io       |
               |                                          +---------------------------+
               v                                                         |
+--------------------------------------+                                 |
|  AKS Hub Cluster (aks-mgmt-hub)      |                                 |
|  +--------------------------------+  |                                 |
|  | Istio Ingress Gateway         |   |                                 |
|  | (Hub Services Only)           |   |                                 |
|  +--------------------------------+  |                                 |
|                                      |                                 |
|  ArgoCD | Crossplane | Cert-Manager  |                                 |
|  Backstage | Argo Rollouts           |                                 |
+--------------------------------------+                                 |
               |                                                         |
               | Manages via ArgoCD                                      |
               v                                                         v
+------------------------------------------------------------------------+
|  AKS Spoke Cluster (aks-app-spoke)                                     |
|  +------------------------------------------------------------------+  |
|  | Istio Ingress Gateway                                            |  |
|  | (Application Traffic - Direct from Cloudflare)                   |  |
|  +------------------------------------------------------------------+  |
|                                                                        |
|  portal-api (Node.js) | blog (Gatsby) | frontend (React)               |
+------------------------------------------------------------------------+

Key Design Decision: Decentralized Ingress

The critical decision was where to place the Istio ingress gateways. There are two patterns:

Centralized (Hub Ingress)

Internet -> Hub Gateway -> Routes to Spoke clusters

Decentralized (Spoke Ingress) - What I chose

Internet -> Each cluster's own Gateway

I went with decentralized ingress for one main reason: fault isolation. If the hub cluster goes down (maintenance, failed upgrade, resource issues), my applications remain accessible. The hub is a control plane, not a data plane.

Cluster Breakdown

Hub Cluster Services

Service	Purpose
ArgoCD	GitOps controller - manages deployments to all clusters
Crossplane	Infrastructure as Code - provisions cloud resources
Cert-Manager	TLS certificate automation via Let's Encrypt
Backstage	Developer portal and service catalog
Argo Rollouts	Progressive delivery controller

Spoke Cluster Platform Services

Service	Purpose
Cert-Manager	Independent TLS certificates for spoke ingress
Istio Service Mesh	Traffic management and mTLS

Spoke Cluster Workloads

Application	Description
portal-api	Node.js backend API
portal-frontend	React SPA
blog	Static site (Gatsby)

Istio Configuration

Each cluster runs its own Istio service mesh with an ingress gateway. The spoke cluster handles HTTPS termination with its own TLS certificates:

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: external-gateway
  namespace: istio-ingress
spec:
  selector:
    istio: aks-istio-ingressgateway-external
  servers:
    - port:
        number: 80
        name: http
        protocol: HTTP
      hosts:
        - "portal.chrishouse.io"
        - "portal-api.chrishouse.io"
        - "blog.chrishouse.io"
      tls:
        httpsRedirect: true
    - port:
        number: 443
        name: https
        protocol: HTTPS
      hosts:
        - "portal.chrishouse.io"
        - "portal-api.chrishouse.io"
        - "blog.chrishouse.io"
      tls:
        mode: SIMPLE
        credentialName: wildcard-tls

The wildcard-tls secret is created by cert-manager using a wildcard certificate for *.chrishouse.io. This means the spoke cluster is fully independent for TLS - it doesn't rely on the hub for certificate management.

Each application gets a VirtualService that routes traffic:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: portal-api-vs
  namespace: istio-ingress
spec:
  hosts:
    - "portal-api.chrishouse.io"
  gateways:
    - internal-gateway
  http:
    - route:
        - destination:
            host: portal-api.portal-api.svc.cluster.local
            port:
              number: 80

Removing Redundant Ingress Resources

One issue I ran into: my Helm charts still had nginx Ingress resources defined, even though Istio handles all traffic routing. This caused ArgoCD to show applications as "Progressing" indefinitely.

Why? ArgoCD's health check for Ingress resources waits for a load balancer IP to be assigned. Since nginx-ingress wasn't assigning IPs (Istio handles traffic instead), the Ingress stayed in a pending state forever.

The fix was simple - disable the Ingress in Helm values:

# values.yaml
ingress:
  enabled: false  # Istio VirtualService handles routing

And remove the Ingress from Kustomize resources:

# kustomization.yaml
resources:
  - namespace.yaml
  - serviceaccount.yaml
  - configmap.yaml
  - deployment.yaml
  - service.yaml
  # - ingress.yaml  # Removed - using Istio

Traffic Flow Explained

DNS: Cloudflare manages DNS for *.chrishouse.io, pointing to the spoke cluster's external IP
TLS: Terminated at the Istio ingress gateway using wildcard certificates issued by cert-manager (each cluster manages its own certs)
Service Mesh: Istio routes to the correct service based on VirtualService rules
mTLS: All pod-to-pod traffic within the mesh is encrypted

For hub services (backstage.chrishouse.io):

Cloudflare -> Hub Istio Gateway -> Backstage Pod

For spoke services (portal-api.chrishouse.io):

Cloudflare -> Spoke Istio Gateway -> Portal API Pod

The hub is never in the path for spoke traffic.

ArgoCD Multi-Cluster Management

ArgoCD on the hub manages applications across both clusters. Each Application specifies its destination:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: portal-api
  namespace: argocd
spec:
  destination:
    name: aks-app-spoke  # Target cluster
    namespace: portal-api
  source:
    repoURL: https://github.com/crh225/ARMServicePortal.git
    path: infra/kubernetes/portal-api
    targetRevision: main

The hub cluster is registered as a destination in ArgoCD, allowing centralized management while keeping workloads distributed.

Pros and Cons

Advantages

Fault isolation - Hub issues don't affect running applications
Independent scaling - Clusters scale based on their workload type
Cleaner upgrades - Upgrade hub without touching production apps
Security boundaries - Platform credentials isolated from app workloads

Tradeoffs

Complexity - Two clusters to manage instead of one
Cost - Additional control plane costs (though node pools can be sized appropriately)
Networking - Cross-cluster communication requires additional configuration
Observability - Metrics and logs spread across clusters

When to Use This Pattern

Good fit:

Multiple teams deploying to Kubernetes
High availability requirements for applications
Frequent platform upgrades
Compliance requirements for separation of concerns

Overkill for:

Single small application
Development/testing environments
Cost-sensitive projects with low traffic

The hub-spoke pattern provides a solid foundation for scaling the platform as needs grow.

Ultimately I chose this pattern also for cost. I want to be able to have my apps and blog available and shut down the hub when not in use and active development.

A key improvement was making the spoke cluster fully independent with its own cert-manager and TLS certificates. This means the spoke can serve HTTPS traffic even when the hub is completely offline - true fault isolation for production workloads.

Why Hub-Spoke?

The Architecture

Key Design Decision: Decentralized Ingress

Cluster Breakdown

Hub Cluster Services

Spoke Cluster Platform Services

Spoke Cluster Workloads

Istio Configuration

Removing Redundant Ingress Resources

Traffic Flow Explained

ArgoCD Multi-Cluster Management

Pros and Cons

Advantages

Tradeoffs

When to Use This Pattern

Resources

Tags:

Comments

Why Hub-Spoke?

The Architecture

Key Design Decision: Decentralized Ingress

Cluster Breakdown

Hub Cluster Services

Spoke Cluster Platform Services

Spoke Cluster Workloads

Istio Configuration

Removing Redundant Ingress Resources

Traffic Flow Explained

ArgoCD Multi-Cluster Management

Pros and Cons

Advantages

Tradeoffs

When to Use This Pattern

Resources

Tags:

Related Posts

Platform AI Part 3: Building a Multi-Agent System for DevOps

Platform AI Part 2: Adding Agentic Tool Use for AKS Management

Platform AI Part 1: Building a Personal AI Chatbot with Claude and Cloudflare

Comments