High Availability

Production deployments of UDS Core need redundancy, autoscaling, and fault tolerance to meet uptime requirements. This section provides per-component guides for configuring high availability across the platform stack.

These guides assume you already have UDS Core deployed and are familiar with UDS bundle overrides. Where relevant, guides also cover how to adjust resource allocations for production workloads. For background on each component, see the Core Features concepts.

HA capabilities at a glance

Component	HA Mechanism	External Dependency	Default Behavior
Keycloak	HPA (2–5 replicas)	PostgreSQL	Single replica (devMode)
Grafana	HPA (2–5 replicas)	PostgreSQL	Single replica
Loki	Multi-replica (SimpleScalable)	S3-compatible storage	3 replicas per tier
Vector	DaemonSet	None	One pod per node
Prometheus	Resource tuning	External TSDB (for multi-replica)	Single replica
Authservice	HPA (1–3 replicas)	Redis / Valkey	Single replica
Falcosidekick	Static replicas	None	2 replicas
Istio (istiod)	HPA + pod anti-affinity	None	HPA (1–5 replicas)
Istio (gateways)	HPA	None	HPA (1–5 replicas)

These external resources provide foundational Kubernetes and component-specific HA guidance that complements the UDS Core guides below:

Kubernetes: Running in multiple zones — distributing workloads across failure domains
Kubernetes: Disruptions and PodDisruptionBudgets — protecting availability during voluntary disruptions
Kubernetes: Horizontal Pod Autoscaling — scaling workloads based on resource utilization
EKS Best Practices: Reliability — AWS-specific resilience patterns
AKS Best Practices: Reliability — Azure-specific resilience patterns
GKE Best Practices: Scalability — GCP-specific scaling and HA guidance

Component guides

Keycloak External PostgreSQL, HPA autoscaling, and waypoint proxy scaling.

Logging (Loki & Vector) Loki replica tuning with external S3 storage and Vector production resource configuration.

Monitoring (Grafana & Prometheus) HA Grafana with external PostgreSQL and Prometheus resource tuning.

Authservice External Redis session store and replica scaling for SSO proxy resilience.

Runtime Security Falcosidekick replica tuning for resilient alert delivery.

Service Mesh Istio control plane and ingress gateway scaling, resource tuning, and anti-affinity verification.

High Availability

HA capabilities at a glance

Related Documentation

Component guides