Monitoring
What you’ll accomplish
Section titled “What you’ll accomplish”You’ll configure UDS Core’s monitoring stack for production high availability: enabling multi-replica Grafana with an external PostgreSQL database, tuning Prometheus resource allocation, and configuring Prometheus storage sizing and data retention.
Prerequisites
Section titled “Prerequisites”- UDS CLI installed
- Access to a Kubernetes cluster (multi-node, multi-AZ recommended)
- An external PostgreSQL instance accessible from the cluster (for Grafana HA)
Before you begin
Section titled “Before you begin”Grafana’s default embedded SQLite database does not support multiple replicas and is lost on pod restart. Connecting an external PostgreSQL database enables multi-replica HA and persists dashboard configuration across restarts.
-
Enable HA Grafana with external PostgreSQL
Set the autoscaling toggle and non-secret database settings directly in the bundle, and use variables for credentials:
uds-bundle.yaml packages:- name: corerepository: registry.defenseunicorns.com/public/coreref: x.x.x-upstreamoverrides:grafana:grafana:values:# Enable HorizontalPodAutoscaler- path: autoscaling.enabledvalue: trueuds-grafana-config:values:# PostgreSQL port- path: postgresql.portvalue: 5432# Database name- path: postgresql.databasevalue: "grafana"variables:# PostgreSQL hostname- name: GRAFANA_PG_HOSTpath: postgresql.host# Database user- name: GRAFANA_PG_USERpath: postgresql.user# Database password- name: GRAFANA_PG_PASSWORDpath: postgresql.passwordsensitive: trueuds-config.yaml variables:core:GRAFANA_PG_HOST: "postgres.example.com"GRAFANA_PG_USER: "grafana"GRAFANA_PG_PASSWORD: "your-password"The default HPA configuration when HA is enabled:
Setting Default Override Path Minimum replicas 2 autoscaling.minReplicasMaximum replicas 5 autoscaling.maxReplicasCPU target utilization 70% autoscaling.metrics[0].resource.target.averageUtilizationMemory target utilization 75% autoscaling.metrics[1].resource.target.averageUtilizationScale-down stabilization 300 seconds autoscaling.behavior.scaleDown.stabilizationWindowSecondsScale-down rate 1 pod per 300 seconds autoscaling.behavior.scaleDown.policies[0] -
Tune Prometheus resources
Prometheus runs as a single replica in UDS Core. For clusters with many nodes or high cardinality workloads, increase resource allocation to prevent OOM kills and slow queries. See the Prometheus storage documentation for guidance on resource needs relative to ingestion volume.
uds-bundle.yaml packages:- name: corerepository: registry.defenseunicorns.com/public/coreref: x.x.x-upstreamoverrides:kube-prometheus-stack:kube-prometheus-stack:values:# Adjust resource values for your environment- path: prometheus.prometheusSpec.resourcesvalue:requests:cpu: 200mmemory: 1Gilimits:cpu: 500mmemory: 4Gi -
Configure Prometheus storage and retention
UDS Core provisions a 50Gi PVC with 10-day retention by default. Adjust both settings based on the number of scrape targets, metrics cardinality, and how long you need to keep historical data.
Setting Default Override Path PVC size 50Gi prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storageTime-based retention 10d prometheus.prometheusSpec.retentionSize-based retention Disabled prometheus.prometheusSpec.retentionSizeuds-bundle.yaml packages:- name: corerepository: registry.defenseunicorns.com/public/coreref: x.x.x-upstreamoverrides:kube-prometheus-stack:kube-prometheus-stack:values:# Increase PVC size for longer retention- path: prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storagevalue: "100Gi"# Keep data for 30 days- path: prometheus.prometheusSpec.retentionvalue: "30d"# Safety cap: drop oldest data if disk usage exceeds this limit- path: prometheus.prometheusSpec.retentionSizevalue: "90GB"To estimate disk needs, use the upstream formula from the Prometheus storage documentation:
needed_disk_space = retention_time_seconds × ingested_samples_per_second × bytes_per_sampleIn practice,
bytes_per_sampleaverages 1–2 bytes after compression. Start with the defaults, then queryprometheus_tsdb_storage_blocks_bytesin Grafana to observe actual usage and project growth before resizing. -
Create and deploy your bundle
Terminal window uds create <path-to-bundle-dir>uds deploy uds-bundle-<name>-<arch>-<version>.tar.zst
Verification
Section titled “Verification”Confirm the monitoring stack is healthy:
# Check Grafana HPA statusuds zarf tools kubectl get hpa -n grafana
# Confirm multiple Grafana replicas are runninguds zarf tools kubectl get pods -n grafana -l app.kubernetes.io/name=grafana
# Check Prometheus resource allocationuds zarf tools kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus -o jsonpath='{.items[0].spec.containers[0].resources}'
# Check Prometheus PVC size and capacityuds zarf tools kubectl get pvc -n monitoring -l "operator.prometheus.io/name=kube-prometheus-stack-prometheus" -o custom-columns=NAME:.metadata.name,REQ:.spec.resources.requests.storage,CAP:.status.capacity.storageSuccess criteria:
- Grafana HPA shows
MINPODS: 2and current replicas >= 2 - All Grafana pods are
RunningandReady - Grafana UI loads and dashboards display data
- Prometheus pod resource limits match your configured values
- Prometheus PVC request matches your configured storage size
Troubleshooting
Section titled “Troubleshooting”Problem: Grafana pods not starting after enabling HA
Section titled “Problem: Grafana pods not starting after enabling HA”Symptoms: Pods in CrashLoopBackOff or Error state, logs show database connection errors.
Solution: Verify PostgreSQL connectivity and credentials:
uds zarf tools kubectl logs -n grafana -l app.kubernetes.io/name=grafana --tail=50Ensure the PostgreSQL instance allows connections from the cluster’s CIDR range.
Problem: Dashboards show “No data” after migrating to HA
Section titled “Problem: Dashboards show “No data” after migrating to HA”Symptoms: Grafana UI loads but dashboards display no data points.
Solution: Dashboard definitions are stored in ConfigMaps and will load automatically. If data sources are missing, check that the Grafana PostgreSQL database was initialized correctly — the Grafana migration should run automatically on first startup with the new database.
Problem: Prometheus pod crash-looping with storage errors
Section titled “Problem: Prometheus pod crash-looping with storage errors”Symptoms: Pod in CrashLoopBackOff, logs show no space left on device or TSDB compaction errors.
Solution: Check Prometheus logs and PVC capacity:
uds zarf tools kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus --tail=50uds zarf tools kubectl get pvc -n monitoring -l "operator.prometheus.io/name=kube-prometheus-stack-prometheus" -o custom-columns=NAME:.metadata.name,REQ:.spec.resources.requests.storage,CAP:.status.capacity.storageEither lower the retentionSize limit to trigger faster data pruning, or expand the PVC using the Resize Prometheus PVCs runbook.
Related Documentation
Section titled “Related Documentation”- Grafana: High Availability Setup — configuring Grafana for HA with an external database
- Grafana: Configure a PostgreSQL Database — database backend options for Grafana
- Prometheus: Storage — TSDB storage architecture and operational guidance
- Prometheus: Remote Storage Integrations — Thanos, Cortex, VictoriaMetrics, and other remote storage options
- Resize Prometheus PVCs — runbook for expanding Prometheus storage on a running cluster
Next steps
Section titled “Next steps”These guides and concepts may be useful to explore next: