A Dead Man’s Switch is an alert that allows us to trigger an alert when our Prometheus cluster is no longer functioning correctly. This is important, because it would be a disaster if our monitoring pipeline went down and critical alerts weren’t being triggered!
In Prometheus, we need to define an alerting rule that continuously triggers/alerts, so that if it no longer triggers, we know.
groups: - name: meta rules: - alert: DeadMansSwitch expr: vector(1) labels: severity: critical annotations: description: This is a DeadMansSwitch meant to ensure that the entire Alerting pipeline is functional. summary: Alerting DeadMansSwitch
Once we have the Alert working in Prometheus, we can configure Alertmanager. For this config I will be using email notification to a service called DeadMansSnitch that handles DeadMansSwitch alerting and has a nice integration with pagerduty.
In DeadMansSnitch, we need to configure our time interval to 15mins. This will alert us if our DeadMansSwitch alert in Alertmanager has been silent 3 times, and allows us to sidestep the unreliability of email, which we’ll use from AlertManager to DeadMansSnitch in want of a better option.
In AlertManager, this will be our config. You will need to replace the
to the DeadMansSnitch email for your account.
global: smtp_smarthost: "smtp.sendgrid.net:587" smtp_from: "Alertmanager <email@example.com>" smtp_auth_username: apikey smtp_auth_password: "<api key here>" route: routes: - match: alertname: DeadMansSwitch repeat_interval: 5m receiver: deadmansswitch receivers: - name: deadmansswitch email_configs: - to: "firstname.lastname@example.org"
You now have Prometheus/Alertmanager triggering DeadMansSnitch!
You can now setup a Pagerduty integration to page you when the DeadMansSwitch fails.