Building a Hub-Spoke AKS Architecture with Istio Service Mesh — hero banner

December 20, 2025·4 min read

After running a single-cluster Kubernetes setup for a while, I recently migrated to a hub-spoke architecture. This post covers why I made the change, how the traffic routing works, and the key decisions around where to place the Istio ingress gateways.


Why Hub-Spoke?

The single-cluster approach is simple but has limitations:

  • Blast radius - A misconfiguration or resource exhaustion affects everything
  • Scaling constraints - Platform services compete with applications for resources
  • Upgrade risk - Cluster upgrades put everything at risk simultaneously

The hub-spoke pattern separates concerns:

  • Hub cluster - Runs platform/control plane services (ArgoCD, Crossplane, Backstage)
  • Spoke cluster(s) - Runs application workloads (APIs, frontends, microservices)

The Architecture

Here's what my setup looks like:

View Interactive Diagram →

                              +---------------------------+
                              |   Cloudflare (DNS/CDN)    |
                              +-------------+-------------+
                                            |
               +----------------------------+----------------------------+
               |                                                         |
               v                                                         v
+---------------------------+                             +---------------------------+
|  Hub Traffic              |                             |  Spoke Traffic            |
|  backstage.chrishouse.io  |                             |  portal.chrishouse.io     |
|  argocd.chrishouse.io     |                             |  portal-api.chrishouse.io |
+---------------------------+                             |  blog.chrishouse.io       |
               |                                          +---------------------------+
               v                                                         |
+--------------------------------------+                                 |
|  AKS Hub Cluster (aks-mgmt-hub)      |                                 |
|  +--------------------------------+  |                                 |
|  | Istio Ingress Gateway         |   |                                 |
|  | (Hub Services Only)           |   |                                 |
|  +--------------------------------+  |                                 |
|                                      |                                 |
|  ArgoCD | Crossplane | Cert-Manager  |                                 |
|  Backstage | Argo Rollouts           |                                 |
+--------------------------------------+                                 |
               |                                                         |
               | Manages via ArgoCD                                      |
               v                                                         v
+------------------------------------------------------------------------+
|  AKS Spoke Cluster (aks-app-spoke)                                     |
|  +------------------------------------------------------------------+  |
|  | Istio Ingress Gateway                                            |  |
|  | (Application Traffic - Direct from Cloudflare)                   |  |
|  +------------------------------------------------------------------+  |
|                                                                        |
|  portal-api (Node.js) | blog (Gatsby) | frontend (React)               |
+------------------------------------------------------------------------+

Key Design Decision: Decentralized Ingress

The critical decision was where to place the Istio ingress gateways. There are two patterns:

Centralized (Hub Ingress)

Internet -> Hub Gateway -> Routes to Spoke clusters

Decentralized (Spoke Ingress) - What I chose

Internet -> Each cluster's own Gateway

I went with decentralized ingress for one main reason: fault isolation. If the hub cluster goes down (maintenance, failed upgrade, resource issues), my applications remain accessible. The hub is a control plane, not a data plane.


Cluster Breakdown

Hub Cluster Services

Service Purpose
ArgoCD GitOps controller - manages deployments to all clusters
Crossplane Infrastructure as Code - provisions cloud resources
Cert-Manager TLS certificate automation via Let's Encrypt
Backstage Developer portal and service catalog
Argo Rollouts Progressive delivery controller

Spoke Cluster Platform Services

Service Purpose
Cert-Manager Independent TLS certificates for spoke ingress
Istio Service Mesh Traffic management and mTLS

Spoke Cluster Workloads

Application Description
portal-api Node.js backend API
portal-frontend React SPA
blog Static site (Gatsby)

Istio Configuration

Each cluster runs its own Istio service mesh with an ingress gateway. The spoke cluster handles HTTPS termination with its own TLS certificates:

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: external-gateway
  namespace: istio-ingress
spec:
  selector:
    istio: aks-istio-ingressgateway-external
  servers:
    - port:
        number: 80
        name: http
        protocol: HTTP
      hosts:
        - "portal.chrishouse.io"
        - "portal-api.chrishouse.io"
        - "blog.chrishouse.io"
      tls:
        httpsRedirect: true
    - port:
        number: 443
        name: https
        protocol: HTTPS
      hosts:
        - "portal.chrishouse.io"
        - "portal-api.chrishouse.io"
        - "blog.chrishouse.io"
      tls:
        mode: SIMPLE
        credentialName: wildcard-tls

The wildcard-tls secret is created by cert-manager using a wildcard certificate for *.chrishouse.io. This means the spoke cluster is fully independent for TLS - it doesn't rely on the hub for certificate management.

Each application gets a VirtualService that routes traffic:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: portal-api-vs
  namespace: istio-ingress
spec:
  hosts:
    - "portal-api.chrishouse.io"
  gateways:
    - internal-gateway
  http:
    - route:
        - destination:
            host: portal-api.portal-api.svc.cluster.local
            port:
              number: 80

Removing Redundant Ingress Resources

One issue I ran into: my Helm charts still had nginx Ingress resources defined, even though Istio handles all traffic routing. This caused ArgoCD to show applications as "Progressing" indefinitely.

Why? ArgoCD's health check for Ingress resources waits for a load balancer IP to be assigned. Since nginx-ingress wasn't assigning IPs (Istio handles traffic instead), the Ingress stayed in a pending state forever.

The fix was simple - disable the Ingress in Helm values:

# values.yaml
ingress:
  enabled: false  # Istio VirtualService handles routing

And remove the Ingress from Kustomize resources:

# kustomization.yaml
resources:
  - namespace.yaml
  - serviceaccount.yaml
  - configmap.yaml
  - deployment.yaml
  - service.yaml
  # - ingress.yaml  # Removed - using Istio

Traffic Flow Explained

  1. DNS: Cloudflare manages DNS for *.chrishouse.io, pointing to the spoke cluster's external IP
  2. TLS: Terminated at the Istio ingress gateway using wildcard certificates issued by cert-manager (each cluster manages its own certs)
  3. Service Mesh: Istio routes to the correct service based on VirtualService rules
  4. mTLS: All pod-to-pod traffic within the mesh is encrypted

For hub services (backstage.chrishouse.io):

Cloudflare -> Hub Istio Gateway -> Backstage Pod

For spoke services (portal-api.chrishouse.io):

Cloudflare -> Spoke Istio Gateway -> Portal API Pod

The hub is never in the path for spoke traffic.


ArgoCD Multi-Cluster Management

ArgoCD on the hub manages applications across both clusters. Each Application specifies its destination:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: portal-api
  namespace: argocd
spec:
  destination:
    name: aks-app-spoke  # Target cluster
    namespace: portal-api
  source:
    repoURL: https://github.com/crh225/ARMServicePortal.git
    path: infra/kubernetes/portal-api
    targetRevision: main

The hub cluster is registered as a destination in ArgoCD, allowing centralized management while keeping workloads distributed.


Pros and Cons

Advantages

  1. Fault isolation - Hub issues don't affect running applications
  2. Independent scaling - Clusters scale based on their workload type
  3. Cleaner upgrades - Upgrade hub without touching production apps
  4. Security boundaries - Platform credentials isolated from app workloads

Tradeoffs

  1. Complexity - Two clusters to manage instead of one
  2. Cost - Additional control plane costs (though node pools can be sized appropriately)
  3. Networking - Cross-cluster communication requires additional configuration
  4. Observability - Metrics and logs spread across clusters

When to Use This Pattern

Good fit:

  • Multiple teams deploying to Kubernetes
  • High availability requirements for applications
  • Frequent platform upgrades
  • Compliance requirements for separation of concerns

Overkill for:

  • Single small application
  • Development/testing environments
  • Cost-sensitive projects with low traffic

The hub-spoke pattern provides a solid foundation for scaling the platform as needs grow.

Ultimately I chose this pattern also for cost. I want to be able to have my apps and blog available and shut down the hub when not in use and active development.

A key improvement was making the spoke cluster fully independent with its own cert-manager and TLS certificates. This means the spoke can serve HTTPS traffic even when the hub is completely offline - true fault isolation for production workloads.


Resources

Enjoyed this post? Give it a clap!

Comments