Sending AWS CloudWatch alarms through SNS to MSTeams

I'm new to AWS os please take the following statements with a grain of salt. Also, I'm tired, but I want to get this of my chest before the weekend begins (although, technically, it has already begun), so it might not be so coherent.

AWS provides some minimum monitoring of your resources with a tool called CloudWatch. Think of prometheus + grafana, but more limited. Still, is good enough to the point it makes sense to setup some Alerts on it. Many of AWS's resources are not processes running on a computer you have access to, so you can't always install some exporters and do the monitoring yourself.

If you're like me, CloudWatch Alerts must be sent to the outside world so you can receive them and react. One way to do this1 is to channel them through SNS. SNS supports many protocols, most of them internal to AWS, but also HTTP/S. SNS is a pub-sub system, and requires a little bit of protocol before it works.

On the other end we2 have MSTeams3. MSTeams has many ways of communicating. One is Chat, which is a crappy chat67, and another is some kind of mix between a blog and twitter, confusingly called Teams. The idea in a Team is that you can post... Posts? Articles? And from them you can have an unthreaded converstion. Only Teams have webhooks; Chats do not, so you can't point SNS there.

If you have read other articles about integrating CloudWatch Alerts or SNS to MSTeams, they will always tell you that you not only need SNS, but also a Lambda program. Since we already handle gazillion servers, not all of them in AWS, and one in particular we pay quite cheap for dedicated HW, and also we're trying to slim our AWS bill (who doesn't), I decided to see if I can build my own bridge between SNS and Teams.

I already said that SNS has a litte protocol. The idea is that when you create an HTTP/S Subscription in SNS, it will POST a first message to the URL you define. This message will have a JSON payload. We're interested in two fields:

{
    "Type": "SubscriptionConfirmation",
    "SubscribeURL": "..."
}

What you have to do is get this URL and call it. That way SNS will know the endpoint exists and will associate an ARN to the Subscription. Otherwise, the Subscription will stay unconfirmed and no messages will be sent to it. Interestingly, you can't neither edit nor remove Subscriptions (at least not with the web interface), and I read that unconfirmed Subscriptions will disappear after 3 days or so 4.

SNS messages are also a JSON payload POST'ed to the URL. They look like this:

{
  "Type" : "Notification",
  "MessageId" : "<uuid1>",
  "TopicArn" : "<arn>",
  "Subject" : "...",
  "Message" : "...",
  "Timestamp" : "2024-01-19T14:29:54.147Z",
  "SignatureVersion" : "1",
  "Signature" : "cTQUWntlQW5evk/bZ5lkhSdWj2+4oa/4eApdgkcdebegX3Dvwpq786Zi6lZbxGsjof2C+XMt4rV9xM1DBlsVq6tsBQvkfzGBzOvwerZZ7j4Sfy/GTJvtS4L2x/OVUCLleY3ULSCRYX2H1TTTanK44tOU5f8W+8AUz1DKRT+qL+T2fWqmUrPYSK452j/rPZcZaVwZnNaYkroPmJmI4gxjr/37Q6gA8sK+WyC0U91/MDKHpuAmCAXrhgrJIpEX/1t2mNlnlbJpcsR9h05tHJNkQEkPwFY0HFTnyGvTM2DP6Ep7C2z83/OHeVJ6pa7Sn3txVWR5AQC1PF8UbT7zdGJL9Q==",
  "SigningCertURL" : "https://sns.eu-west-1.amazonaws.com/SimpleNotificationService-01d088a6f77103d0fe307c0069e40ed6.pem",
  "UnsubscribeURL" : "https://sns.eu-west-1.amazonaws.com/?Action=Unsubscribe&SubscriptionArn=<arn>:<uuid2>"
}

Now, CloudWatch Alerts sent via SNS are sent in the Message field. As Message's value is a string and the Alert is encoded as JSON, yes, you guessed it, it's double encoded:

{
  "Message" : "{\"AlarmName\":\"foo\",...}"
}

Sigh. After unwrapping it, it looks like this:

{
  "AlarmName": "...",
  "AlarmDescription": "...",
  "AWSAccountId": "...",
  "AlarmConfigurationUpdatedTimestamp": "2024-01-18T14:32:17.244+0000",
  "NewStateValue": "ALARM",
  "NewStateReason": "Threshold Crossed: 1 out of the last 1 datapoints [10.337853107344637 (18/01/24 14:28:00)] was greater than the threshold (10.0) (minimum 1 datapoint for OK -> ALARM transition).",
  "StateChangeTime": "2024-01-18T14:34:54.103+0000",
  "Region": "EU (Ireland)",
  "AlarmArn": "<alarm_arn>",
  "OldStateValue": "INSUFFICIENT_DATA",
  "OKActions": [],
  "AlarmActions": [
    "<sns_arn>"
  ],
  "InsufficientDataActions": [],
  "Trigger": {
    "MetricName": "CPUUtilization",
    "Namespace": "AWS/EC2",
    "StatisticType": "Statistic",
    "Statistic": "AVERAGE",
    "Unit": null,
    "Dimensions": [
      {
        "value": "<aws_id>",
        "name": "InstanceId"
      }
    ],
    "Period": 60,
    "EvaluationPeriods": 1,
    "DatapointsToAlarm": 1,
    "ComparisonOperator": "GreaterThanThreshold",
    "Threshold": 10.0,
    "TreatMissingData": "missing",
    "EvaluateLowSampleCountPercentile": ""
  }
}

The name and description are arbitrary texts you wrote when setting the Alarm and the Subscription. Notice that the region is not the codename as in eu-west-1 but a supposedly more human readable text. The rest is mostly info about the Alarm itself. Also notice the Dimensions field. I don't know what other data comes here (probably the arbitrary fields and values you can setup in the Alarm), all I can say is that that format (list of dicts with only two fields, one called name and the other value) is possibly the most annoying implementation of a simple dict. I hope they have a reason for that, besides over engineering.

Finally, notice that the only info we get here about the source of the alarm is the InstanceId. As those are random strings, to me they don't mean anything. Maybe I can setup the Alarm so it also includes the instance'a name5, and even maybe the URL pointing to the metric's graph.

Finally, Teams' webhook also expects a JSON payload. I didn't delve much in what you can give to it, I just used the title, text and themeColor fields. At least text can be written in MarkDown. You get such a webhook going to the Team, click in the ("vertical ellipsis") icon, "Connectors", add a webhook and obtain the URL from there. @type and @context I copied from an SNS-to-Lambda-to-Teams post.

So to build a bridge between CloudWatch Alerts through SNS to MSTeams's Team we just need a quite straightforward script. I decided to write it in Flask, but I'm pretty sure writing it in plain http.server and urllib.request to avoid dependencies is not much more work; I just didn't want to do it. Maybe I should have tried FastAPI instead; I simply forgot about it.

Without further ado, here's the script. I'm running Python 3.8, so I don't have case/match yet.

#! /usr/bin/env python3

from flask import Flask, request
import json
import requests

app = Flask(__name__)

@app.route('/', methods=[ 'POST' ])
def root():
    print(f"{request.data=}")

    request_data = json.loads(request.data)

    # python3.8, not case/match yet
    message_type = request_data['Type']

    if message_type == 'SubscriptionConfirmation':
        response = requests.get(request_data['SubscribeURL'])
        print(response.text)

        return f"hello {request_data['TopicArn']}!"

    message = {
        '@type': 'MessageCard',
        '@context': 'http://schema.org/extensions',
        'themeColor': '4200c5',
    }

    if message_type == 'Notification':
        try:
            alarm = json.loads(request_data['Message'])
        except json.JSONDecodeError:
            message['title'] = request_data['Subject']
            message['text']  = request_data['Message']
        else:
            instance_id = alarm['Trigger']['Dimensions'][0]['value']
            state = alarm['NewStateValue']

            if state == 'ALARM':
                color = 'FF0000'
            else:
                color = '00FF00'

            message['title'] = f"{instance_id}: {alarm['Trigger']['MetricName']} {state}"
            message['text']  = f"""{alarm['AlarmName']}

{alarm['Trigger']['MetricName']} {alarm['Trigger']['ComparisonOperator']} {alarm['Trigger']['Threshold']}
for {int(alarm['Trigger']['Period']) // 60} minutes.

{alarm['AlarmDescription']}

{alarm['NewStateReason']}

for {instance_id} passed to {state} at {alarm['StateChangeTime']}."""
            message['themeColor'] = color

        response = requests.post('https://<company>.webhook.office.com/webhookb2/<uuid1>@<uuid2>/IncomingWebhook/<id>/<uuid3>', json=message)
        print(response.text)

        return f"OK"

  1. Again, I'm new to AWS. This is how it's setup at $NEW_JOB, but there might be better ways. If there are, I'm happy to hear them. 

  2. 'we' as in me and my colleagues. 

  3. Don't get me started... 

  4. I know all this because right now I have like 5-8 unconfirmed Subscriptions because I had to figure all this out, mostly because I couldn't find sample data or, preferably, a tool that already does this. They're 5-8 because you can't create a second Subscription to the same URL, so I changed the port for every failed attempt to confirm the Subscription. 

  5. We don't have pets, but don't quite have cattle either. We have cows we name, and we get a little bit sad when we sell them, but we're happy when they invite us to the barbecue. 

  6. OK, I already started... 

  7. I added this footnote (I mean, the previous one... but this one too) while reviewing the post before publishing. Putting the correct number means editing the whole post, changing each number twice, which is error prone. In theory nikola and/or MarkDown support auto-numbered footnotes, but I never managed to make it work. I used to have the same issue with the previous static blog/stite compiler, ikiwiki, so this is not the first time I have out-of-order footnotes. In any case, I feel like they're a quirk that I find cute and somehow defining.