Providing systems’ reliability and highest performance efficiency are the fundamentals of DevOps methodology. Monitoring system performance would not be possible without up-to-date tools that help tracking the systems’ availability and health. Thereby, implementation of monitoring systems becomes a key focus area for any SRE engineer.


Monitoring the system performance issues and their further analysis relies on data gathering. The collected metrics represent key indicators taken at some regular intervals to monitor a system over time. Commonly they include data on the system throughput, the percentage of successfully run operations, the number of erroneous results, and performance metrics needed to observe the system efficiency and availability over a period of time.


Monitoring customers’ systems


Starting working on a new project, we always discuss with the customer key performance metrics that indicate their system’s health. In other words, we specify some particular parameters whose change is crucial for the consistent system performance. Once a special condition occurs or some values get exceeded, a set of triggers activate the notification system and create an alert of corresponding priority. The priority level depends on many factors like service importance, the actual meaning of the obtained data and the results it may have, etc. Alerts of the highest priority are critical and require immediate human intervention to prevent serious system failures. Alerts with lesser priority indicate a problem, which is not as urgent. The notifications are then sent to recipients via chosen communication channels (email, phone, Slack, etc).


SHALB provides customer support and monitoring services 24/7. Our engineer-on-duty keeps an eye on monitoring notifications and addresses a queue of incoming tickets. Our responding to monitoring notifications takes no longer than 5 minutes. We address critical issues immediately after receiving the notification. Resolving non-critical tasks starts within an hour.


In our practice we use a stack of up-to-date solutions for monitoring and alert management such as Zabbix, Nagios, Prometheus, ELK, Grafana, Datadog, Opsgenie, Pingdom, Mmonit, etc. While we are flexible to operate with a variety of tools, the choice of technology is always determined by the customer’s needs and objectives. Our extensive focus on monitoring enables us to stay informed 24/7 ensuring zero downtime of the most complex customers’ systems.