Monitoring

What you’ll accomplish

You’ll configure UDS Core’s monitoring stack for production high availability: enabling multi-replica Grafana with an external PostgreSQL database, tuning Prometheus resource allocation, and configuring Prometheus storage sizing and data retention.

Prerequisites

UDS CLI installed
UDS Registry account created and authenticated locally with a read token
Access to a Kubernetes cluster (multi-node, multi-AZ recommended)
An external PostgreSQL instance accessible from the cluster (for Grafana HA)

Before you begin

Grafana’s default embedded SQLite database does not support multiple replicas and is lost on pod restart. Connecting an external PostgreSQL database enables multi-replica HA and persists dashboard configuration across restarts.

Steps

Enable HA Grafana with external PostgreSQL

Set the autoscaling toggle and non-secret database settings directly in the bundle, and use variables for credentials:

packages:
  - name: core
    repository: registry.defenseunicorns.com/public/core
    ref: x.x.x-upstream
    overrides:
      grafana:
        grafana:
          values:
            # Enable HorizontalPodAutoscaler
            - path: autoscaling.enabled
              value: true
        uds-grafana-config:
          values:
            # PostgreSQL port
            - path: postgresql.port
              value: 5432
            # Database name
            - path: postgresql.database
              value: "grafana"
          variables:
            # PostgreSQL hostname
            - name: GRAFANA_PG_HOST
              path: postgresql.host
            # Database user
            - name: GRAFANA_PG_USER
              path: postgresql.user
            # Database password
            - name: GRAFANA_PG_PASSWORD
              path: postgresql.password
              sensitive: true

variables:
  core:
    GRAFANA_PG_HOST: "postgres.example.com"
    GRAFANA_PG_USER: "grafana"
    GRAFANA_PG_PASSWORD: "your-password"

The default HPA configuration when HA is enabled:

Setting	Default	Override Path
Minimum replicas	2	`autoscaling.minReplicas`
Maximum replicas	5	`autoscaling.maxReplicas`
CPU target utilization	70%	`autoscaling.metrics[0].resource.target.averageUtilization`
Memory target utilization	75%	`autoscaling.metrics[1].resource.target.averageUtilization`
Scale-down stabilization	300 seconds	`autoscaling.behavior.scaleDown.stabilizationWindowSeconds`
Scale-down rate	1 pod per 300 seconds	`autoscaling.behavior.scaleDown.policies[0]`

Tune Prometheus resources

Prometheus runs as a single replica in UDS Core. For clusters with many nodes or high cardinality workloads, increase resource allocation to prevent OOM kills and slow queries. See the Prometheus storage documentation for guidance on resource needs relative to ingestion volume.
uds-bundle.yaml
```
packages:
  - name: core
    repository: registry.defenseunicorns.com/public/core
    ref: x.x.x-upstream
    overrides:
      kube-prometheus-stack:
        kube-prometheus-stack:
          values:
            # Adjust resource values for your environment
            - path: prometheus.prometheusSpec.resources
              value:
                requests:
                  cpu: 200m
                  memory: 1Gi
                limits:
                  cpu: 500m
                  memory: 4Gi
```
Use Grafana’s built-in Prometheus dashboards to observe actual CPU and memory usage before choosing resource values. Over-provisioning wastes cluster resources; under-provisioning causes OOM kills and metric gaps.

Multi-replica Prometheus is not tested or recommended at this time with UDS Core. Scaling beyond a single replica requires an external TSDB backend (e.g., Thanos, Cortex, Mimir, VictoriaMetrics) to handle deduplication — each replica independently scrapes all targets, producing duplicate data. You would also need to reconfigure Grafana’s data source to query the external backend. See the Prometheus remote storage integrations if you need to pursue this path.

Configure Prometheus storage and retention

UDS Core provisions a 50Gi PVC with 10-day retention by default. Adjust both settings based on the number of scrape targets, metrics cardinality, and how long you need to keep historical data.

Setting	Default	Override Path
PVC size	50Gi	`prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage`
Time-based retention	10d	`prometheus.prometheusSpec.retention`
Size-based retention	Disabled	`prometheus.prometheusSpec.retentionSize`

packages:
  - name: core
    repository: registry.defenseunicorns.com/public/core
    ref: x.x.x-upstream
    overrides:
      kube-prometheus-stack:
        kube-prometheus-stack:
          values:
            # Increase PVC size for longer retention
            - path: prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage
              value: "100Gi"
            # Keep data for 30 days
            - path: prometheus.prometheusSpec.retention
              value: "30d"
            # Safety cap: drop oldest data if disk usage exceeds this limit
            - path: prometheus.prometheusSpec.retentionSize
              value: "90GB"

To estimate disk needs, use the upstream formula from the Prometheus storage documentation:

needed_disk_space = retention_time_seconds × ingested_samples_per_second × bytes_per_sample

In practice, bytes_per_sample averages 1–2 bytes after compression. Start with the defaults, then query prometheus_tsdb_storage_blocks_bytes in Grafana to observe actual usage and project growth before resizing.

Create and deploy your bundle

uds create <path-to-bundle-dir>
uds deploy uds-bundle-<name>-<arch>-<version>.tar.zst

Verification

Confirm the monitoring stack is healthy:

# Check Grafana HPA status
uds zarf tools kubectl get hpa -n grafana

# Confirm multiple Grafana replicas are running
uds zarf tools kubectl get pods -n grafana -l app.kubernetes.io/name=grafana

# Check Prometheus resource allocation
uds zarf tools kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus -o jsonpath='{.items[0].spec.containers[0].resources}'

# Check Prometheus PVC size and capacity
uds zarf tools kubectl get pvc -n monitoring -l "operator.prometheus.io/name=kube-prometheus-stack-prometheus" -o custom-columns=NAME:.metadata.name,REQ:.spec.resources.requests.storage,CAP:.status.capacity.storage

Success criteria:

Grafana HPA shows MINPODS: 2 and current replicas >= 2
All Grafana pods are Running and Ready
Grafana UI loads and dashboards display data
Prometheus pod resource limits match your configured values
Prometheus PVC request matches your configured storage size

Troubleshooting

Problem: Grafana pods not starting after enabling HA

Symptoms: Pods in CrashLoopBackOff or Error state, logs show database connection errors.

Solution: Verify PostgreSQL connectivity and credentials:

uds zarf tools kubectl logs -n grafana -l app.kubernetes.io/name=grafana --tail=50

Ensure the PostgreSQL instance allows connections from the cluster’s CIDR range.

Problem: Dashboards show “No data” after migrating to HA

Symptoms: Grafana UI loads but dashboards display no data points.

Solution: Dashboard definitions are stored in ConfigMaps and will load automatically. If data sources are missing, check that the Grafana PostgreSQL database was initialized correctly — the Grafana migration should run automatically on first startup with the new database.

Problem: Prometheus pod crash-looping with storage errors

Symptoms: Pod in CrashLoopBackOff, logs show no space left on device or TSDB compaction errors.

Solution: Check Prometheus logs and PVC capacity:

uds zarf tools kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus --tail=50
uds zarf tools kubectl get pvc -n monitoring -l "operator.prometheus.io/name=kube-prometheus-stack-prometheus" -o custom-columns=NAME:.metadata.name,REQ:.spec.resources.requests.storage,CAP:.status.capacity.storage

Either lower the retentionSize limit to trigger faster data pruning, or expand the PVC using the Resize Prometheus PVCs runbook.

Grafana: High Availability Setup — configuring Grafana for HA with an external database
Grafana: Configure a PostgreSQL Database — database backend options for Grafana
Prometheus: Storage — TSDB storage architecture and operational guidance
Prometheus: Remote Storage Integrations — Thanos, Cortex, VictoriaMetrics, and other remote storage options
Resize Prometheus PVCs — runbook for expanding Prometheus storage on a running cluster
Configure HA for Logging — Loki provides the log data that Grafana visualizes and also requires HA configuration.
Monitoring & Observability concepts — Background on the Prometheus, Grafana, and Alertmanager stack in UDS Core.