The Problem
Testing that AWS Cloudwatch alarms are configured correctly, can be triggered and result in an on-call alert. This is rather complex issue due to a number of factors:
- Simulating data that will trigger the alarm. Different metrics require different data (CPU, Memory, Network)
- The alarm needed to be active for a period of time (5-20 minutes) as the on-call alert is a human process
Method 1: set-alarm-state
Initially set-alarm-state
seemed like an ideal candidate. Manual control over the alarm's state.
Problems
When testing it would change the alarm's state to ALARM
but it would immedietly toggle back to OK
. It just wasn't long enough to trigger an on-call alert.
Method 2: put-metric-alarm
Use put-metric-alarm
to update an alarm's configuration so healthy stats will trigger the alarm.
Problems
This worked but it was complex. You must send all of the alarm's configuration as it overwrites all of the previous values. After testing the on-call alert then the original configuration must be sent back over put-metric-alarm
. There is risk that the alarm will become misconfigured.
Method 3: put-metric-data
Use put-metric-data
to send custom metric data designed to trigger the alarm.
The first implementation used javascript's setInterval
to send the custom metric data once every second. Then the script just needed to be ran for the time required to trigger an on-call alert.
A future iteration could include constructing a timespan of custom metric data and send it on a single request. Or pulling down the alarm's cloudformation template and auto generate the custom metric data.
Problems
Some companies might feel hesitant to grant permissions to do this. Especially in higher environments.