Providing systems’ reliability and highest performance efficiency are the fundamentals of DevOps methodology. This would not be possible without up-to-date monitoring tools that help tracking the systems’ availability and health. Thereby implementation of monitoring systems becomes a key focus area for any SRE engineer.

 

Monitoring the system performance issues and their further analysis relies on data gathering. The collected metrics represent key indicators taken at some regular intervals to monitor a system over time. Commonly they include data on the system throughput, the percentage of successfully run operations, the number of erroneous results, and performance metrics needed to observe the system efficiency and availability over a period of time.

 

Started working on a new project, we always discuss the basic metrics that need to be covered by monitoring. This enables us to customize monitoring tools on particular parameters that are crucial for the customer’s system sustainability. Once a special condition occurs or some values get exceeded, a set of triggers activate the notification system and create an alert of corresponding priority. The priority level depends on many factors like service importance, the actual meaning of the obtained data and the results it may have, etc. Alerts of the highest priority are critical and require immediate human intervention to prevent some serious system failures. Alerts with lesser priority are not as urgent, i.e. the problem could be postponed for some time. The notifications are then sent to recipients via chosen communication channels (email, phone, Slack, etc).

 

In our practice we use a stack of up-to-date solutions for monitoring and alert management such as Zabbix, Nagios, Prometheus, ELK, Grafana, Datadog, Opsgenie, Pingdom, Mmonit, etc. While we are flexible to operate with a variety of tools, the choice of technology is always determined by the customer’s needs and objectives. Our extensive focus on monitoring enables us to stay informed 24/7 ensuring zero downtime of the most complex customers’ systems.