Monitoring

What you’ll accomplish

You’ll configure UDS Core’s monitoring stack for production high availability: enabling multi-replica Grafana with an external PostgreSQL database and tuning Prometheus resource allocation.

Prerequisites

UDS CLI installed
Access to a Kubernetes cluster (multi-node, multi-AZ recommended)
An external PostgreSQL instance accessible from the cluster (for Grafana HA)

Before you begin

Grafana’s default embedded SQLite database does not support multiple replicas and is lost on pod restart. Connecting an external PostgreSQL database enables multi-replica HA and persists dashboard configuration across restarts.

Steps

Enable HA Grafana with external PostgreSQL

Set the autoscaling toggle and non-secret database settings directly in the bundle, and use variables for credentials:

packages:
  - name: core
    repository: registry.defenseunicorns.com/public/core
    ref: x.x.x-upstream
    overrides:
      grafana:
        grafana:
          values:
            # Enable HorizontalPodAutoscaler
            - path: autoscaling.enabled
              value: true
        uds-grafana-config:
          values:
            # PostgreSQL port
            - path: postgresql.port
              value: 5432
            # Database name
            - path: postgresql.database
              value: "grafana"
          variables:
            # PostgreSQL hostname
            - name: GRAFANA_PG_HOST
              path: postgresql.host
            # Database user
            - name: GRAFANA_PG_USER
              path: postgresql.user
            # Database password
            - name: GRAFANA_PG_PASSWORD
              path: postgresql.password
              sensitive: true

variables:
  core:
    GRAFANA_PG_HOST: "postgres.example.com"
    GRAFANA_PG_USER: "grafana"
    GRAFANA_PG_PASSWORD: "your-password"

The default HPA configuration when HA is enabled:

Setting	Default	Override Path
Minimum replicas	2	`autoscaling.minReplicas`
Maximum replicas	5	`autoscaling.maxReplicas`
CPU target utilization	70%	`autoscaling.metrics[0].resource.target.averageUtilization`
Memory target utilization	75%	`autoscaling.metrics[1].resource.target.averageUtilization`
Scale-down stabilization	300 seconds	`autoscaling.behavior.scaleDown.stabilizationWindowSeconds`
Scale-down rate	1 pod per 300 seconds	`autoscaling.behavior.scaleDown.policies[0]`

Tune Prometheus resources

Prometheus runs as a single replica in UDS Core. For clusters with many nodes or high cardinality workloads, increase resource allocation to prevent OOM kills and slow queries. See the Prometheus storage documentation for guidance on resource needs relative to ingestion volume.
uds-bundle.yaml
```
packages:
  - name: core
    repository: registry.defenseunicorns.com/public/core
    ref: x.x.x-upstream
    overrides:
      kube-prometheus-stack:
        kube-prometheus-stack:
          values:
            # Adjust resource values for your environment
            - path: prometheus.prometheusSpec.resources
              value:
                requests:
                  cpu: 200m
                  memory: 1Gi
                limits:
                  cpu: 500m
                  memory: 4Gi
```
Use Grafana’s built-in Prometheus dashboards to observe actual CPU and memory usage before choosing resource values. Over-provisioning wastes cluster resources; under-provisioning causes OOM kills and metric gaps.

Multi-replica Prometheus is not tested or recommended at this time with UDS Core. Scaling beyond a single replica requires an external TSDB backend (e.g., Thanos, Cortex, Mimir, VictoriaMetrics) to handle deduplication — each replica independently scrapes all targets, producing duplicate data. You would also need to reconfigure Grafana’s data source to query the external backend. See the Prometheus remote storage integrations if you need to pursue this path.

Create and deploy your bundle

uds create <path-to-bundle-dir>
uds deploy uds-bundle-<name>-<arch>-<version>.tar.zst

Verification

Confirm the monitoring stack is healthy:

# Check Grafana HPA status
uds zarf tools kubectl get hpa -n grafana

# Confirm multiple Grafana replicas are running
uds zarf tools kubectl get pods -n grafana -l app.kubernetes.io/name=grafana

# Check Prometheus resource allocation
uds zarf tools kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus -o jsonpath='{.items[0].spec.containers[0].resources}'

Success criteria:

Grafana HPA shows MINPODS: 2 and current replicas >= 2
All Grafana pods are Running and Ready
Grafana UI loads and dashboards display data
Prometheus pod resource limits match your configured values

Troubleshooting

Problem: Grafana pods not starting after enabling HA

Symptoms: Pods in CrashLoopBackOff or Error state, logs show database connection errors.

Solution: Verify PostgreSQL connectivity and credentials:

uds zarf tools kubectl logs -n grafana -l app.kubernetes.io/name=grafana --tail=50

Ensure the PostgreSQL instance allows connections from the cluster’s CIDR range.

Problem: Dashboards show “No data” after migrating to HA

Symptoms: Grafana UI loads but dashboards display no data points.

Solution: Dashboard definitions are stored in ConfigMaps and will load automatically. If data sources are missing, check that the Grafana PostgreSQL database was initialized correctly — the Grafana migration should run automatically on first startup with the new database.

Grafana: High Availability Setup — configuring Grafana for HA with an external database
Grafana: Configure a PostgreSQL Database — database backend options for Grafana
Prometheus: Storage — TSDB storage architecture and operational guidance
Prometheus: Remote Storage Integrations — Thanos, Cortex, VictoriaMetrics, and other remote storage options

Next steps

These guides and concepts may be useful to explore next:

Configure HA for Logging Loki provides the log data that Grafana visualizes and also requires HA configuration.

Monitoring & Observability concepts Background on the Prometheus, Grafana, and Alertmanager stack in UDS Core.

Cookie Settings Privacy Policy