Metrics and monitoring, a lot of developers have probably seen those big television screens or looked on the computer screens in the room where all the (dev)ops people are sitting. When something goes down those people instantly seems to know what to do after staring a few seconds to their screen.
Taking ownership of your application.
Although we are used that our colleagues from (dev)ops have these tools but we to have to face that we as developers needs these toold as well in order to take responsibility.
In my personal experience it happened way too often that I had to be notified by other people that my applications are suddenly failing in production. It does not have to be application itself that is failing but maybe the application is triggering failures in services connected to your service. If you don’t have any metrics or monitoring to your availability then it will be hard to be accountable when your software is running production. After all you cannot see how your application behaving in a system where there is more traffic than in your development environment. An important step in taking ownership and responsibility is getting metrics / monitoring to your desk!
Simple steps to take
At Bynder we are transitioning very fast to microservices which are orchestrated by K8S (Kubernetes) monitored by prometheus. We use Grafana to translate data from prometheus in graphs and alerting. It’s unlikely that developers will have these graphs 24/7 open, so it would be nice if Grafana could alert developers when shit hits the fan. At Bynder we use Slack for internal communication. Grafana supports slack webhooks (and many other clients as well) so we can be notified by Slack when odd behaviour is happening is our services.
Step 1: Preventing OOMP kills & Keep track of restarts
When K8S is hooked on prometheus, prometheus gets data about the k8s pods. You can think of CPU and memory utilization. If the k8s pod is not configured properly then these pods will be OOMP killed. An OOMP kill can be described as: “K8S killed the pod because it exceeded the requested resource limits”. Since the service is killed instantly this means that you might have data loss in some cases. Without proper metrics you might not even know and why this is happening because K8S will automatically spawn a new pod. CLI tools like k9s might show the kill but without detailed information. With metrics you can make those problems visible.
K8S Memory Boundaries
Every K8S pod should be configured with CPU & Memory that is guaranteed by K8S and
a maximum amount of CPU & Memory. This is typically easy to configure in the deployment
file.
resources:
requests: # This ust be gauanteed by K8S
memory: "256Mi"
limits: # The maximum amount the K8S will allow
memory: "1024Mi"
With this in mind we can easily implement our first two alerts:
- If
requests.memory > 256mi
for more than30 minutes
then send alert to slack - If
limits.memory > 922mi
directly send alert to slack
The alert for case 1 is less urgent, but it could be a sign that requested memory or is not properly configured or there might be a memory leak somewhere in your application.
The alert for case 2 should be immediately investigated. You will get this alert because you are consuming more than 90% of your maximum allowed resourced. An OOMP kill is waiting around the corner.
To get those alerts we could do the following query in prometheus / grafana:
sum(container_memory_usage_bytes{pod=~"your-service-name-here-.*",cluster_name="$cluster_name"}) by (pod)
K8S CPU Boundaries
Consuming too much CPU power does not lead to OOMP kills. Your service will get throttled. Setting alerts CPU works roughly the same as setting memory restrictions.
resources:
requests: # This ust be gauanteed by K8S
cpu: "250m"
limits: # The maximum amount the K8S will allow
cpu: "1000m"
Again with the above configuration in mind we can easily do something like:
- If
requests.cpu > 250m
for more then30 minutes
then send alert to slack - If
limits.cpu > 900m
directly send alert to slack
In prometheus we can just do this:
irate(container_cpu_usage_seconds_total{container=~"your-service-name-here.*", cluster_name="$cluster_name"}[$__range])
With these two simple metrics in place you can already act proactively before you will have a (dev)ops engineer at your desk. …But there is more simple thing that we can do!
Tracking restarts
Tracking restarts is a great indicator if your pod is behaving like expected. As a personal rule of thumb I consider three or more restarts of my pod as suspicious. It does not have to be bad news but, I definitely do want to investigate this.
Note: A downside of this metric could be that alerts cannot be turned off until a new pod is recreated, this is something to keep in mind!
Tracking restarts is fairly easy as well
kube_pod_container_status_restarts_total{container="your-service-name-here", cluster_name="$cluster_name"}
A small note on alerts
Being alerted is of course very nice, it allows you to proactively respond when shit hits the fan. What I typically would advise is bring this kind of alerting in your main team communication channel. Some people will argue that they will find it super annoying but this is the whole point! It’s a call to take action! If some alerts are not valuable for you then turn them off because apparently there are not important. This is why I would not recommend getting your alerting in a special alerting channel. This channel will eventually be muted which defeats the purpose of being alerted.
Conclusion
Taking ownership over your software is a very important thing, and although we often don’t have (and we shouldn’t have) access to production network, we can get a good idea how our application is behaving with simple metrics. You can do way more than just measuring CPU & Memory usage (like setting alerts on response times, status codes etc), but the above three metrics will inform you before your (dev)ops engineer does.