Taking more ownership with three simple metrics.

Posted on Feb 7, 2021

metrics monitoring ownership

Metrics and monitoring, a lot of developers have probably seen those big television screens or looked on the computer screens in the room where all the (dev)ops people are sitting. When something goes down those people instantly seems to know what to do after staring a few seconds to their screen.

Taking ownership of your application.

Although we are used that our colleagues from (dev)ops have these tools but we to have to face that we as developers needs these toold as well in order to take responsibility.

In my personal experience it happened way too often that I had to be notified by other people that my applications are suddenly failing in production. It does not have to be application itself that is failing but maybe the application is triggering failures in services connected to your service. If you don’t have any metrics or monitoring to your availability then it will be hard to be accountable when your software is running production. After all you cannot see how your application behaving in a system where there is more traffic than in your development environment. An important step in taking ownership and responsibility is getting metrics / monitoring to your desk!

Simple steps to take

At Bynder we are transitioning very fast to microservices which are orchestrated by K8S (Kubernetes) monitored by prometheus. We use Grafana to translate data from prometheus in graphs and alerting. It’s unlikely that developers will have these graphs 24/7 open, so it would be nice if Grafana could alert developers when shit hits the fan. At Bynder we use Slack for internal communication. Grafana supports slack webhooks (and many other clients as well) so we can be notified by Slack when odd behaviour is happening is our services.

Step 1: Preventing OOMP kills & Keep track of restarts

When K8S is hooked on prometheus, prometheus gets data about the k8s pods. You can think of CPU and memory utilization. If the k8s pod is not configured properly then these pods will be OOMP killed. An OOMP kill can be described as: “K8S killed the pod because it exceeded the requested resource limits”. Since the service is killed instantly this means that you might have data loss in some cases. Without proper metrics you might not even know and why this is happening because K8S will automatically spawn a new pod. CLI tools like k9s might show the kill but without detailed information. With metrics you can make those problems visible.

K8S Memory Boundaries

Every K8S pod should be configured with CPU & Memory that is guaranteed by K8S and
a maximum amount of CPU & Memory. This is typically easy to configure in the deployment file.

resources:
  requests:  # This ust be gauanteed by K8S
    memory: "256Mi"
  limits:    # The maximum amount the K8S will allow
    memory: "1024Mi"

With this in mind we can easily implement our first two alerts:

If requests.memory > 256mi for more than 30 minutes then send alert to slack
If limits.memory > 922mi directly send alert to slack

The alert for case 1 is less urgent, but it could be a sign that requested memory or is not properly configured or there might be a memory leak somewhere in your application.

The alert for case 2 should be immediately investigated. You will get this alert because you are consuming more than 90% of your maximum allowed resourced. An OOMP kill is waiting around the corner.

To get those alerts we could do the following query in prometheus / grafana:

sum(container_memory_usage_bytes{pod=~"your-service-name-here-.*",cluster_name="$cluster_name"}) by (pod)

K8S CPU Boundaries

Consuming too much CPU power does not lead to OOMP kills. Your service will get throttled. Setting alerts CPU works roughly the same as setting memory restrictions.

resources:
  requests:  # This ust be gauanteed by K8S
    cpu: "250m"
  limits:    # The maximum amount the K8S will allow
    cpu: "1000m"

Again with the above configuration in mind we can easily do something like:

If requests.cpu > 250m for more then 30 minutes then send alert to slack
If limits.cpu > 900m directly send alert to slack

In prometheus we can just do this:

irate(container_cpu_usage_seconds_total{container=~"your-service-name-here.*", cluster_name="$cluster_name"}[$__range])

With these two simple metrics in place you can already act proactively before you will have a (dev)ops engineer at your desk. …But there is more simple thing that we can do!

Tracking restarts

Tracking restarts is a great indicator if your pod is behaving like expected. As a personal rule of thumb I consider three or more restarts of my pod as suspicious. It does not have to be bad news but, I definitely do want to investigate this.

Note: A downside of this metric could be that alerts cannot be turned off until a new pod is recreated, this is something to keep in mind!

Tracking restarts is fairly easy as well

kube_pod_container_status_restarts_total{container="your-service-name-here", cluster_name="$cluster_name"}

A small note on alerts

Being alerted is of course very nice, it allows you to proactively respond when shit hits the fan. What I typically would advise is bring this kind of alerting in your main team communication channel. Some people will argue that they will find it super annoying but this is the whole point! It’s a call to take action! If some alerts are not valuable for you then turn them off because apparently there are not important. This is why I would not recommend getting your alerting in a special alerting channel. This channel will eventually be muted which defeats the purpose of being alerted.

Conclusion

Taking ownership over your software is a very important thing, and although we often don’t have (and we shouldn’t have) access to production network, we can get a good idea how our application is behaving with simple metrics. You can do way more than just measuring CPU & Memory usage (like setting alerts on response times, status codes etc), but the above three metrics will inform you before your (dev)ops engineer does.