Skip to content

Monitoring

You’ll configure UDS Core’s monitoring stack for production high availability: enabling multi-replica Grafana with an external PostgreSQL database, tuning Prometheus resource allocation, and configuring Prometheus storage sizing and data retention.

  • UDS CLI installed
  • Access to a Kubernetes cluster (multi-node, multi-AZ recommended)
  • An external PostgreSQL instance accessible from the cluster (for Grafana HA)

Grafana’s default embedded SQLite database does not support multiple replicas and is lost on pod restart. Connecting an external PostgreSQL database enables multi-replica HA and persists dashboard configuration across restarts.

  1. Enable HA Grafana with external PostgreSQL

    Set the autoscaling toggle and non-secret database settings directly in the bundle, and use variables for credentials:

    uds-bundle.yaml
    packages:
    - name: core
    repository: registry.defenseunicorns.com/public/core
    ref: x.x.x-upstream
    overrides:
    grafana:
    grafana:
    values:
    # Enable HorizontalPodAutoscaler
    - path: autoscaling.enabled
    value: true
    uds-grafana-config:
    values:
    # PostgreSQL port
    - path: postgresql.port
    value: 5432
    # Database name
    - path: postgresql.database
    value: "grafana"
    variables:
    # PostgreSQL hostname
    - name: GRAFANA_PG_HOST
    path: postgresql.host
    # Database user
    - name: GRAFANA_PG_USER
    path: postgresql.user
    # Database password
    - name: GRAFANA_PG_PASSWORD
    path: postgresql.password
    sensitive: true
    uds-config.yaml
    variables:
    core:
    GRAFANA_PG_HOST: "postgres.example.com"
    GRAFANA_PG_USER: "grafana"
    GRAFANA_PG_PASSWORD: "your-password"

    The default HPA configuration when HA is enabled:

    SettingDefaultOverride Path
    Minimum replicas2autoscaling.minReplicas
    Maximum replicas5autoscaling.maxReplicas
    CPU target utilization70%autoscaling.metrics[0].resource.target.averageUtilization
    Memory target utilization75%autoscaling.metrics[1].resource.target.averageUtilization
    Scale-down stabilization300 secondsautoscaling.behavior.scaleDown.stabilizationWindowSeconds
    Scale-down rate1 pod per 300 secondsautoscaling.behavior.scaleDown.policies[0]
  2. Tune Prometheus resources

    Prometheus runs as a single replica in UDS Core. For clusters with many nodes or high cardinality workloads, increase resource allocation to prevent OOM kills and slow queries. See the Prometheus storage documentation for guidance on resource needs relative to ingestion volume.

    uds-bundle.yaml
    packages:
    - name: core
    repository: registry.defenseunicorns.com/public/core
    ref: x.x.x-upstream
    overrides:
    kube-prometheus-stack:
    kube-prometheus-stack:
    values:
    # Adjust resource values for your environment
    - path: prometheus.prometheusSpec.resources
    value:
    requests:
    cpu: 200m
    memory: 1Gi
    limits:
    cpu: 500m
    memory: 4Gi
  3. Configure Prometheus storage and retention

    UDS Core provisions a 50Gi PVC with 10-day retention by default. Adjust both settings based on the number of scrape targets, metrics cardinality, and how long you need to keep historical data.

    SettingDefaultOverride Path
    PVC size50Giprometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage
    Time-based retention10dprometheus.prometheusSpec.retention
    Size-based retentionDisabledprometheus.prometheusSpec.retentionSize
    uds-bundle.yaml
    packages:
    - name: core
    repository: registry.defenseunicorns.com/public/core
    ref: x.x.x-upstream
    overrides:
    kube-prometheus-stack:
    kube-prometheus-stack:
    values:
    # Increase PVC size for longer retention
    - path: prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage
    value: "100Gi"
    # Keep data for 30 days
    - path: prometheus.prometheusSpec.retention
    value: "30d"
    # Safety cap: drop oldest data if disk usage exceeds this limit
    - path: prometheus.prometheusSpec.retentionSize
    value: "90GB"

    To estimate disk needs, use the upstream formula from the Prometheus storage documentation:

    needed_disk_space = retention_time_seconds × ingested_samples_per_second × bytes_per_sample

    In practice, bytes_per_sample averages 1–2 bytes after compression. Start with the defaults, then query prometheus_tsdb_storage_blocks_bytes in Grafana to observe actual usage and project growth before resizing.

  4. Create and deploy your bundle

    Terminal window
    uds create <path-to-bundle-dir>
    uds deploy uds-bundle-<name>-<arch>-<version>.tar.zst

Confirm the monitoring stack is healthy:

Terminal window
# Check Grafana HPA status
uds zarf tools kubectl get hpa -n grafana
# Confirm multiple Grafana replicas are running
uds zarf tools kubectl get pods -n grafana -l app.kubernetes.io/name=grafana
# Check Prometheus resource allocation
uds zarf tools kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus -o jsonpath='{.items[0].spec.containers[0].resources}'
# Check Prometheus PVC size and capacity
uds zarf tools kubectl get pvc -n monitoring -l "operator.prometheus.io/name=kube-prometheus-stack-prometheus" -o custom-columns=NAME:.metadata.name,REQ:.spec.resources.requests.storage,CAP:.status.capacity.storage

Success criteria:

  • Grafana HPA shows MINPODS: 2 and current replicas >= 2
  • All Grafana pods are Running and Ready
  • Grafana UI loads and dashboards display data
  • Prometheus pod resource limits match your configured values
  • Prometheus PVC request matches your configured storage size

Problem: Grafana pods not starting after enabling HA

Section titled “Problem: Grafana pods not starting after enabling HA”

Symptoms: Pods in CrashLoopBackOff or Error state, logs show database connection errors.

Solution: Verify PostgreSQL connectivity and credentials:

Terminal window
uds zarf tools kubectl logs -n grafana -l app.kubernetes.io/name=grafana --tail=50

Ensure the PostgreSQL instance allows connections from the cluster’s CIDR range.

Problem: Dashboards show “No data” after migrating to HA

Section titled “Problem: Dashboards show “No data” after migrating to HA”

Symptoms: Grafana UI loads but dashboards display no data points.

Solution: Dashboard definitions are stored in ConfigMaps and will load automatically. If data sources are missing, check that the Grafana PostgreSQL database was initialized correctly — the Grafana migration should run automatically on first startup with the new database.

Problem: Prometheus pod crash-looping with storage errors

Section titled “Problem: Prometheus pod crash-looping with storage errors”

Symptoms: Pod in CrashLoopBackOff, logs show no space left on device or TSDB compaction errors.

Solution: Check Prometheus logs and PVC capacity:

Terminal window
uds zarf tools kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus --tail=50
uds zarf tools kubectl get pvc -n monitoring -l "operator.prometheus.io/name=kube-prometheus-stack-prometheus" -o custom-columns=NAME:.metadata.name,REQ:.spec.resources.requests.storage,CAP:.status.capacity.storage

Either lower the retentionSize limit to trigger faster data pruning, or expand the PVC using the Resize Prometheus PVCs runbook.

These guides and concepts may be useful to explore next: