Metrics Alerting
UDS Core deploys Prometheus with alerting capabilities enabled through the Prometheus Operator. This allows you to create alerts based on metrics collected from your applications and infrastructure using PrometheusRule custom resources and these alerts will automatically be routed to Alertmanager.
Default Alert Rules
Section titled “Default Alert Rules”By default UDS Core ships with a set of default Alerting rules from the upstream kube-prometheus-stack chart. You can find documentation/runbooks for these rules here.
These default rules are a great starting point for monitoring the health of your Kubernetes cluster and the components of UDS Core. However, you will likely want to create custom alerting rules specific to your applications and use cases.
Disabling Default Alert Rules
Section titled “Disabling Default Alert Rules”If you want to disable the default alerting rules provided by kube-prometheus-stack, you can do so by setting the following override in your UDS bundle:
packages: - name: uds-core repository: ghcr.io/defenseunicorns/packages/uds/core ref: x.x.x overrides: kube-prometheus-stack: kube-prometheus-stack: values: # Disable specific individual rules by name - path: defaultRules.disabled value: KubeControllerManagerDown: true KubeSchedulerDown: true # Disable entire rule groups with boolean toggles - path: defaultRules.rules.kubeControllerManager value: false - path: defaultRules.rules.kubeSchedulerAlerting value: falseThis example shows both approaches: disabling individual rules (KubeControllerManagerDown and KubeSchedulerDown) and disabling entire rule groups (kubeControllerManager and kubeSchedulerAlerting). Use individual rule disabling for fine-tuned control and rule group disabling for broader changes.
Creating Custom Alert Rules
Section titled “Creating Custom Alert Rules”Create custom alerts using PrometheusRule CRs. These CRs are dynamically managed and loaded by the Prometheus Operator. All PrometheusRule alerts will be routed to Alertmanager. It is recommended to check these CRs into version control to declaratively manage your alerting rules.
Example PrometheusRule
Section titled “Example PrometheusRule”This is an example PrometheusRule that creates two alerts: one for pods that are restarting frequently and another for high memory usage in containers. When these alerts are triggered, they will be sent to Alertmanager and grouped under the my-app alert group.
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: my-app-alerts namespace: my-appspec: groups: - name: my-app rules: - alert: PodRestartingFrequently expr: increase(kube_pod_container_status_restarts_total[1h]) > 5 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} is restarting frequently" runbook: "https://example.com/runbooks/pod-restart" description: "Pod restarted {{ $value }} times in the last hour"
- alert: HighMemoryUsage expr: | (container_memory_working_set_bytes / container_spec_memory_limit_bytes) * 100 > 80 for: 15m labels: severity: warning annotations: summary: "High memory usage detected" runbook: "https://example.com/runbooks/high-memory-usage" description: "Container using {{ $value }}% of memory limit"Best Practices
Section titled “Best Practices”Some common best practices when creating PrometheusRule Alerts:
- Deploy
PrometheusRuleCRs in same namespace as your application - Ship
PrometheusRuleCRs with your application if possible - Use meaningful labels like
severityto categorize alerts - Add
forclauses to prevent alert flapping - Include annotations like
summary,description, andrunbookfor context and to make alerts actionable
You can find more information on best practices for alerting from Prometheus here.