Grafana, in conjunction with Prometheus and Alertmanager, is a commonly used solution for monitoring Kubernetes clusters. The stack is universally applicable and can be used both in cloud and bare-metal clusters. It is functional, easily integrable and free, which accounts for its popularity.

 

In this article I will show how to integrate Grafana with Alertmanager, manage silences by means of Grafana, configure Alertmanager to inhibit alerts, and keep this configuration in code for future cases. Following the steps described below you will learn how to:

 

  • add Alertmanager data source to Grafana by code
  • configure Alertmanager to visualize alerts properly
  • suppress some alerts via Alertmanager configuration

 

Requirements

You will need a Kubernetes cluster with installed `kube-prometheus-stack` Helm chart (version 39.5.0). You can use your existing cluster or deploy a testing environment. For an example, see our article Deploying Prometheus Monitoring Stack with Cluster.dev.

 

Introduction

Starting from v.8.0, Grafana is shipped with an integrated alerting system for acting on metrics and logs from a variety of external sources. At the same time, Grafana is compatible with Alertmanager and Prometheus by default – a combination that most of the industry community benefits from when monitoring Kubernetes clusters.

 

One of the reasons why we prefer using Alertmanager over a native Grafana alerting is because it is easier to automate when our configuration is in code. For example, while you can define in code Grafana-managed visualization panels to have them reused afterward, it will be much harder to manage. Alertmanager also comes together with Prometheus in the `kube-prometheus-stack` Helm chart – a resource we use to monitor Kubernetes clusters.

 

Grafana integration with Alertmanager

The first thing we do is configure Grafana integration with Alertmanager.
In order to make it automatic, add the following code to `kube-prometheus-stack` values:


1
2
3
4
5
6
7
8
9
10
grafana:
  additionalDataSources:
   - name: Alertmanager
     type: alertmanager
     url: <a class="c-link" tabindex="-1" href="http://monitoring-kube-prometheus-alertmanager:9093/" target="_blank" rel="noopener noreferrer" data-stringify-link="http://monitoring-kube-prometheus-alertmanager:9093" data-sk="tooltip_parent" data-remove-tab-index="true">http://monitoring-kube-prometheus-alertmanager:9093</a>
     editable: true
     access: proxy
     version: 2
     jsonData:
       implementation: prometheus

Customize the value of the `url:` key if it is different in your case. Deploy the code to your cluster and check it in Grafana data sources.

 

 

Then check active alerts – you should see at least one default alert.

 

 

Add Alertmanager configuration

Sometimes you can’t avoid alerts duplication during current integration, but I believe that in most cases it is possible. To see alerts without duplication you need to configure Alertmanager properly. This means having one receiver per alert.

 

In our case, to keep things simple we will add two receivers:

 

`blackhole` – for alerts with zero priority and no need to be sent
`default` – for alerts with severity level: info, warning, critical

 

The `default` receiver should have all needed notification channels. In our case we have two example channels – `telegram` and `slack`.

 

To automate the setup of Alertmanager configuration, add the following code to the `kube-prometheus-stack` yaml file:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
alertmanager:
  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: [...]
      group_wait: 9s
      group_interval: 9s
      repeat_interval: 120h
      receiver: blackhole
      routes:
        - receiver: default
          group_by: [...]
          match_re:
            severity: "info|warning|critical"
          continue: false
          repeat_interval: 120h
    receivers:
      - name: blackhole
      - name: default
        telegram_configs:
          - chat_id: -000000000
            bot_token: 0000000000:00000000000000000000000000000000000
            message: |
              'Status: &lt;a href="<a class="c-link" href="https://127.0.0.1/" target="_blank" rel="noopener noreferrer" data-stringify-link="https://127.0.0.1" data-sk="tooltip_parent">https://127.0.0.1</a>"&gt;{{ .Status }}&lt;/a&gt;'
              '{{ .CommonAnnotations.message }}'
            api_url: <a class="c-link" href="https://127.0.0.1/" target="_blank" rel="noopener noreferrer" data-stringify-link="https://127.0.0.1" data-sk="tooltip_parent">https://127.0.0.1</a>
            parse_mode: HTML
            send_resolved: true
        slack_configs:
          - api_url: <a class="c-link" href="https://127.0.0.1/services/00000000000/00000000000/000000000000000000000000" target="_blank" rel="noopener noreferrer" data-stringify-link="https://127.0.0.1/services/00000000000/00000000000/000000000000000000000000" data-sk="tooltip_parent">https://127.0.0.1/services/00000000000/00000000000/000000000000000000000000</a>
            username: alertmanager
            title: "Status: {{ .Status }}"
            text: "{{ .CommonAnnotations.message }}"
            title_link: "<a class="c-link" href="https://127.0.0.1/" target="_blank" rel="noopener noreferrer" data-stringify-link="https://127.0.0.1" data-sk="tooltip_parent">https://127.0.0.1</a>"
            send_resolved: true

Deploy the code to your cluster and check for active alerts – they should not be duplicated.

 

Add example inhibition rules

In some cases we want to disable alerts via silences and sometimes it is better to do it in code. Silence is good as a temporary measure. It is, however, impermanent and has to be recreated again if you deploy to an empty cluster. Disabling alerts via code, on the other hand, is a sustainable solution that can be used for repeated deployments.

 

Disabling alerts via silence is simple – just open the Silences tab and create one with a desired duration, for example `99999d`. If you have persistent storage enabled for Alertmanager such silence is permanent.

 

 

This section refers mostly to a second case, because adding silence as code is not an easy task. We will disable two test alerts by the `Watchdog` alert, which is always firing by default.

 

Add this code to `kube-prometheus-stack` yaml file:


1
2
3
4
5
    inhibit_rules:
      - target_matchers:
          - alertname =~ "ExampleTwoAlertToInhibit|ExampleOneAlertToInhibit"
        source_matchers:
          - alertname = Watchdog

The resulting code should look like this:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
alertmanager:
  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: [...]
      group_wait: 9s
      group_interval: 9s
      repeat_interval: 120h
      receiver: blackhole
      routes:
        - receiver: default
          group_by: [...]
          match_re:
            severity: "info|warning|critical"
          continue: false
          repeat_interval: 120h
    inhibit_rules:
      - target_matchers:
          - alertname =~ "ExampleAlertToInhibitOne|ExampleAlertToInhibitTwo"
        source_matchers:
          - alertname = Watchdog
    receivers:
      - name: blackhole
      - name: default
        telegram_configs:
          - chat_id: -000000000
            bot_token: 0000000000:00000000000000000000000000000000000
            message: |
              'Status: &lt;a href="<a class="c-link" tabindex="-1" href="https://127.0.0.1/" target="_blank" rel="noopener noreferrer" data-stringify-link="https://127.0.0.1" data-sk="tooltip_parent" data-remove-tab-index="true">https://127.0.0.1</a>"&gt;{{ .Status }}&lt;/a&gt;'
              '{{ .CommonAnnotations.message }}'
            api_url: <a class="c-link" tabindex="-1" href="https://127.0.0.1/" target="_blank" rel="noopener noreferrer" data-stringify-link="https://127.0.0.1" data-sk="tooltip_parent" data-remove-tab-index="true">https://127.0.0.1</a>
            parse_mode: HTML
            send_resolved: true
        slack_configs:
          - api_url: <a class="c-link" tabindex="-1" href="https://127.0.0.1/services/00000000000/00000000000/000000000000000000000000" target="_blank" rel="noopener noreferrer" data-stringify-link="https://127.0.0.1/services/00000000000/00000000000/000000000000000000000000" data-sk="tooltip_parent" data-remove-tab-index="true">https://127.0.0.1/services/00000000000/00000000000/000000000000000000000000</a>
            username: alertmanager
            title: "Status: {{ .Status }}"
            text: "{{ .CommonAnnotations.message }}"
            title_link: "<a class="c-link" tabindex="-1" href="https://127.0.0.1/" target="_blank" rel="noopener noreferrer" data-stringify-link="https://127.0.0.1" data-sk="tooltip_parent" data-remove-tab-index="true">https://127.0.0.1</a>"
            send_resolved: true

Deploy the code to your cluster. Add test alerts with the following code:


1
2
3
4
5
6
7
8
9
10
11
12
13
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: test-rules
  namespace: monitoring
spec:
  groups:
    - name: "test alerts"
      rules:
        - alert: ExampleAlertToInhibitOne
          expr: vector(1)
        - alert: ExampleAlertToInhibitTwo
          expr: vector(1)

Deploy the code with test alerts to your cluster, check the existence of our test rules in the rules list. Wait for 1-3 minutes to see the test alerts; those alerts should be suppressed.

 

 

Conclusion

In this article we have reviewed a generic case of integrating Grafana with Alertmanager, learnt how to manage silences in Grafana, and inhibit alerts via Alertmanager in code. Now you will be able to manage your alerts in an easy and reproducible way with minimal code. Basic code examples are ready to be used in your projects and can be applicable to any configuration.