Skip to content

Monitoring

You’ll configure UDS Core’s monitoring stack for production high availability: enabling multi-replica Grafana with an external PostgreSQL database and tuning Prometheus resource allocation.

  • UDS CLI installed
  • Access to a Kubernetes cluster (multi-node, multi-AZ recommended)
  • An external PostgreSQL instance accessible from the cluster (for Grafana HA)

Grafana’s default embedded SQLite database does not support multiple replicas and is lost on pod restart. Connecting an external PostgreSQL database enables multi-replica HA and persists dashboard configuration across restarts.

  1. Enable HA Grafana with external PostgreSQL

    Set the autoscaling toggle and non-secret database settings directly in the bundle, and use variables for credentials:

    uds-bundle.yaml
    packages:
    - name: core
    repository: registry.defenseunicorns.com/public/core
    ref: x.x.x-upstream
    overrides:
    grafana:
    grafana:
    values:
    # Enable HorizontalPodAutoscaler
    - path: autoscaling.enabled
    value: true
    uds-grafana-config:
    values:
    # PostgreSQL port
    - path: postgresql.port
    value: 5432
    # Database name
    - path: postgresql.database
    value: "grafana"
    variables:
    # PostgreSQL hostname
    - name: GRAFANA_PG_HOST
    path: postgresql.host
    # Database user
    - name: GRAFANA_PG_USER
    path: postgresql.user
    # Database password
    - name: GRAFANA_PG_PASSWORD
    path: postgresql.password
    sensitive: true
    uds-config.yaml
    variables:
    core:
    GRAFANA_PG_HOST: "postgres.example.com"
    GRAFANA_PG_USER: "grafana"
    GRAFANA_PG_PASSWORD: "your-password"

    The default HPA configuration when HA is enabled:

    SettingDefaultOverride Path
    Minimum replicas2autoscaling.minReplicas
    Maximum replicas5autoscaling.maxReplicas
    CPU target utilization70%autoscaling.metrics[0].resource.target.averageUtilization
    Memory target utilization75%autoscaling.metrics[1].resource.target.averageUtilization
    Scale-down stabilization300 secondsautoscaling.behavior.scaleDown.stabilizationWindowSeconds
    Scale-down rate1 pod per 300 secondsautoscaling.behavior.scaleDown.policies[0]
  2. Tune Prometheus resources

    Prometheus runs as a single replica in UDS Core. For clusters with many nodes or high cardinality workloads, increase resource allocation to prevent OOM kills and slow queries. See the Prometheus storage documentation for guidance on resource needs relative to ingestion volume.

    uds-bundle.yaml
    packages:
    - name: core
    repository: registry.defenseunicorns.com/public/core
    ref: x.x.x-upstream
    overrides:
    kube-prometheus-stack:
    kube-prometheus-stack:
    values:
    # Adjust resource values for your environment
    - path: prometheus.prometheusSpec.resources
    value:
    requests:
    cpu: 200m
    memory: 1Gi
    limits:
    cpu: 500m
    memory: 4Gi
  3. Create and deploy your bundle

    Terminal window
    uds create <path-to-bundle-dir>
    uds deploy uds-bundle-<name>-<arch>-<version>.tar.zst

Confirm the monitoring stack is healthy:

Terminal window
# Check Grafana HPA status
uds zarf tools kubectl get hpa -n grafana
# Confirm multiple Grafana replicas are running
uds zarf tools kubectl get pods -n grafana -l app.kubernetes.io/name=grafana
# Check Prometheus resource allocation
uds zarf tools kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus -o jsonpath='{.items[0].spec.containers[0].resources}'

Success criteria:

  • Grafana HPA shows MINPODS: 2 and current replicas >= 2
  • All Grafana pods are Running and Ready
  • Grafana UI loads and dashboards display data
  • Prometheus pod resource limits match your configured values

Problem: Grafana pods not starting after enabling HA

Section titled “Problem: Grafana pods not starting after enabling HA”

Symptoms: Pods in CrashLoopBackOff or Error state, logs show database connection errors.

Solution: Verify PostgreSQL connectivity and credentials:

Terminal window
uds zarf tools kubectl logs -n grafana -l app.kubernetes.io/name=grafana --tail=50

Ensure the PostgreSQL instance allows connections from the cluster’s CIDR range.

Problem: Dashboards show “No data” after migrating to HA

Section titled “Problem: Dashboards show “No data” after migrating to HA”

Symptoms: Grafana UI loads but dashboards display no data points.

Solution: Dashboard definitions are stored in ConfigMaps and will load automatically. If data sources are missing, check that the Grafana PostgreSQL database was initialized correctly — the Grafana migration should run automatically on first startup with the new database.

These guides and concepts may be useful to explore next: