Monitoring
What you’ll accomplish
Section titled “What you’ll accomplish”You’ll configure UDS Core’s monitoring stack for production high availability: enabling multi-replica Grafana with an external PostgreSQL database and tuning Prometheus resource allocation.
Prerequisites
Section titled “Prerequisites”- UDS CLI installed
- Access to a Kubernetes cluster (multi-node, multi-AZ recommended)
- An external PostgreSQL instance accessible from the cluster (for Grafana HA)
Before you begin
Section titled “Before you begin”Grafana’s default embedded SQLite database does not support multiple replicas and is lost on pod restart. Connecting an external PostgreSQL database enables multi-replica HA and persists dashboard configuration across restarts.
-
Enable HA Grafana with external PostgreSQL
Set the autoscaling toggle and non-secret database settings directly in the bundle, and use variables for credentials:
uds-bundle.yaml packages:- name: corerepository: registry.defenseunicorns.com/public/coreref: x.x.x-upstreamoverrides:grafana:grafana:values:# Enable HorizontalPodAutoscaler- path: autoscaling.enabledvalue: trueuds-grafana-config:values:# PostgreSQL port- path: postgresql.portvalue: 5432# Database name- path: postgresql.databasevalue: "grafana"variables:# PostgreSQL hostname- name: GRAFANA_PG_HOSTpath: postgresql.host# Database user- name: GRAFANA_PG_USERpath: postgresql.user# Database password- name: GRAFANA_PG_PASSWORDpath: postgresql.passwordsensitive: trueuds-config.yaml variables:core:GRAFANA_PG_HOST: "postgres.example.com"GRAFANA_PG_USER: "grafana"GRAFANA_PG_PASSWORD: "your-password"The default HPA configuration when HA is enabled:
Setting Default Override Path Minimum replicas 2 autoscaling.minReplicasMaximum replicas 5 autoscaling.maxReplicasCPU target utilization 70% autoscaling.metrics[0].resource.target.averageUtilizationMemory target utilization 75% autoscaling.metrics[1].resource.target.averageUtilizationScale-down stabilization 300 seconds autoscaling.behavior.scaleDown.stabilizationWindowSecondsScale-down rate 1 pod per 300 seconds autoscaling.behavior.scaleDown.policies[0] -
Tune Prometheus resources
Prometheus runs as a single replica in UDS Core. For clusters with many nodes or high cardinality workloads, increase resource allocation to prevent OOM kills and slow queries. See the Prometheus storage documentation for guidance on resource needs relative to ingestion volume.
uds-bundle.yaml packages:- name: corerepository: registry.defenseunicorns.com/public/coreref: x.x.x-upstreamoverrides:kube-prometheus-stack:kube-prometheus-stack:values:# Adjust resource values for your environment- path: prometheus.prometheusSpec.resourcesvalue:requests:cpu: 200mmemory: 1Gilimits:cpu: 500mmemory: 4Gi -
Create and deploy your bundle
Terminal window uds create <path-to-bundle-dir>uds deploy uds-bundle-<name>-<arch>-<version>.tar.zst
Verification
Section titled “Verification”Confirm the monitoring stack is healthy:
# Check Grafana HPA statusuds zarf tools kubectl get hpa -n grafana
# Confirm multiple Grafana replicas are runninguds zarf tools kubectl get pods -n grafana -l app.kubernetes.io/name=grafana
# Check Prometheus resource allocationuds zarf tools kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus -o jsonpath='{.items[0].spec.containers[0].resources}'Success criteria:
- Grafana HPA shows
MINPODS: 2and current replicas >= 2 - All Grafana pods are
RunningandReady - Grafana UI loads and dashboards display data
- Prometheus pod resource limits match your configured values
Troubleshooting
Section titled “Troubleshooting”Problem: Grafana pods not starting after enabling HA
Section titled “Problem: Grafana pods not starting after enabling HA”Symptoms: Pods in CrashLoopBackOff or Error state, logs show database connection errors.
Solution: Verify PostgreSQL connectivity and credentials:
uds zarf tools kubectl logs -n grafana -l app.kubernetes.io/name=grafana --tail=50Ensure the PostgreSQL instance allows connections from the cluster’s CIDR range.
Problem: Dashboards show “No data” after migrating to HA
Section titled “Problem: Dashboards show “No data” after migrating to HA”Symptoms: Grafana UI loads but dashboards display no data points.
Solution: Dashboard definitions are stored in ConfigMaps and will load automatically. If data sources are missing, check that the Grafana PostgreSQL database was initialized correctly — the Grafana migration should run automatically on first startup with the new database.
Related Documentation
Section titled “Related Documentation”- Grafana: High Availability Setup — configuring Grafana for HA with an external database
- Grafana: Configure a PostgreSQL Database — database backend options for Grafana
- Prometheus: Storage — TSDB storage architecture and operational guidance
- Prometheus: Remote Storage Integrations — Thanos, Cortex, VictoriaMetrics, and other remote storage options
Next steps
Section titled “Next steps”These guides and concepts may be useful to explore next: