Home / Real-time IT infrastructure audit

Real-time IT infrastructure audit

Is your infrastructure performant enough to meet your business challenges?

Our questionnaire will help you find out. Based upon DevOps best practices and metrics, this infrastructure audit checklist identifies all important aspects of a secure and resilient system and helps to discover bottlenecks. Use it to diagnose your infrastructure efficiency!

The infrastructure audit questionnaire consists of 12 questions. The first four represent DORA metrics – key parameters to measure software development and delivery performance as defined by Google’s DevOps Research and Assessment (DORA) team. The DORA metrics questions are based on Google’s Accelerate State of DevOps Report 2021.

The remaining questions relate to DevOps best practices and the level of their implementation in a company.

Upon completion, you will receive a detailed conclusion and recommendations from our DevOps experts on improving your infrastructure.

1. Deployment Frequency
2. Lead Time for Changes
3. Change Failure Rate
4. Mean Time to Recover
5. Infrastructure as Code
6. Containerization and Orchestration
7. Infrastructure Stack Modernity
8. CI/CD
9. Security
10. Backups and Disaster Recovery
11. Observability
12. Documentation

1. Deployment Frequency

This metric measures how often a company deploys code for a particular application, for example, once per week or month. The higher this measure is, the better your product performs.

How often does your company release new changes? Please choose one of the answers below.

Choose one answer

On-demand (multiple deployments per day)

Between once per day and once per week

Between once per week and once per month

Between once per month and once every six months

2. Lead Time for Changes

This metric measures the time for committed code to reach production. The metric indicates the velocity of deployment: the lower its value, the better it is for the manufacturer.

How long does it take in your company for code changes to reach production? Please choose one of the answers below.

Choose one answer

Less than one day

Between one day and one week

Between one week and one month

Between one month and six months

3. Change Failure Rate

This metric captures the percentage of code changes that resulted in incidents, rollbacks, or any type of production failure. The Change Failure Rate indicates the quality of deployed software: the lower the average is, the fewer errors a code contains.

How often do changes in your code lead to critical production issues? Please choose one of the answers below.

Choose one answer

0–15%

15–30%

30–60%

4. Mean Time to Recover

The Mean Time to Recover metric measures the average time required to troubleshoot a component or recover a system after failure. Effective DevOps reduces this metric.

How would you assess a Mean Time to Recover in your company? Please choose one of the answers below.

Choose one answer

Less than an hour

Less than one day

Between one week and one month

5. Infrastructure as Code

Infrastructure as code (IaC) is an approach to set up, provision, and deploy IT infrastructures by describing their resources in code. Implementing the IaC practice allows you to automate deployments, trace and validate infrastructure changes, and deploy environment configurations to create identical environments as often as needed.

Technologies: Terraform, Pulumi, AWS CloudFormation.

Do you follow the Infrastructure as Code approach? Please choose your answer. You can supplement your answer with an additional option (+) if it is relevant to your organization.

Choose one answer

All infrastructure resources are defined in the code

All infrastructure resources are defined in the code.

GitOps approach fully utilized

Git is a single source of truth for all infrastructure operations. The declared and actual infrastructure states are in full correspondence, and any divergences are reconciled automatically.

GitOps implemented partially or change automated via CI/CD system

Automation of infrastructure deployments is triggered by changes in the Git repository, but reconciliation is controlled by the CI/CD system.

Most parts are in the code

Some infrastructure resources are defined in the code, but many tasks are still performed manually.

No parts of the infrastructure are codified

All infrastructure resources are configured manually.

6. Containerization and Orchestration

Containerization is the practice of packaging an application code with all its related files, libraries, and dependencies within a standardized unit, or ‘container.’ Once workloads are containerized, they can run on any platform, be independent of one another in terms of languages or frameworks, and managed collectively with container orchestration tools.

Technologies: Docker, Kubernetes, Rancher, Docker Swarm, OpenShift, EKS, and AWS Fargate.

Do you leverage the advantages of containerization technology? Please choose one of the answers below.

Choose one answer

Complete cloud-native architecture

Applications are natively designed to utilize cloud services and maximize their cloud-native potential, including the use of microservices for application architecture, portability of containerized apps, and facilitated CI/CD efforts.

Container orchestration implemented

Container orchestration is implemented, but the services were not initially designed to be containerized.

Applications are containerized

Applications are packaged within container units and isolated from one another in terms of operation, configuration, and debugging.

Application installed to VMs by scripts or manually

An application is composed all in one piece. The program’s components are tightly coupled and must all be present for the software to run.

7. Infrastructure Stack Modernity

A measure to indicate how much infrastructure corresponds to the latest technology trends and whether it is ready for future challenges. Regular technology updates allow companies to remain technologically advanced and ahead of the competition.

Technologies: vary on performance level.

Is your infrastructure stack modern enough? Please choose one of the answers below.

Choose one answer

Most common technologies are regularly updated

A high level of stability that implies using the latest, most cost-effective practices and tools, e.g., declarative approach, containerization, Terraform, and Kubernetes. Technology updates are regular and follow official releases, i.e., it takes no longer than six months after the official release for the newest version to run in production.

Most common technologies with outdated versions

A high level of stability that implies using the most trendy and cost-effective practices and tools, e.g., declarative approach, containerization, Terraform, and Kubernetes. Technology updates are not regular, meaning the gap between the official release and the newest version running in production is over six months old.

Slightly outdated technology stack

This level implies using tried-and-tested technologies that are somewhat outdated in light of today’s technological advancements. Examples: imperative approach infrastructure tools like Ansible and Salt.

Largely outdated technology stack

The technologies in use are largely outdated, and many are no longer maintained. Examples: virtual machines technology, Chef, Puppet, and infrastructure running on Bash scripts.

8. CI/CD

Continuous Integration (CI) and Continuous Deployment (CD) are the DevOps practices of automated building, testing, and deployment of code to target environments. Implementation of CI/CD enables automation of repetitive tasks, provides for a faster deployment pace, shorter release cycles, early detection of erroneous code and quick fixes, and improves overall code quality.

Technologies: GitLab, GitHub, Argo CD, Bitbucket, and Jenkins.

Which of the CI/CD processes are established in your company? Please choose one or more answers.

Multiple choice

Progressive delivery and feature flags

The next logical step of continuous delivery. The approach is defined by feature flagging during deployments, gradual rollouts, canary launches, blue-green deployments, A/B testing, and so on.

Immutable promotion of artifacts between environments

Artifacts are deployed to one environment at a time and immutably promoted to the next after testing.

Code reviews

All proposed code changes are reviewed before application. Pull requests are an easy way to do this.

Decoupled CI and CD

A trend in modern DevOps that lies in decoupling CI and CD workflows by using separate tools for their implementation. For example, CI is enabled using native GIT provider tools (GitLab Pipelines/GitHub Actions) and CD by a polling model from the cluster itself via GitOps toolkit (ArgoCD, Flux).

Infrastructure testing

All infrastructure is defined in code, and automated tests are applied to verify changes after every single commit.

9. Security

Security is a set of specific guidelines and best practices to protect information, systems, and assets against potential attacks. Effective security strategy mitigates the risks of your data assets being compromised, prevents security breaches and data leakage, and enhances the overall reliability and availability of services.

Techniques and technologies: Threat modeling, risk assessment, Defense in Depth (DiD) approach, security by design principles, and Application Security (AppSec) tools.

Which of the following security practices are implemented in your company? Please choose one or more answers.

Multiple choice

Supply chain security established

Any open-source or third-party components are tested for potential security issues before employing.

DevSecOps

An application security practice that involves introducing security in the early stages of the software development lifecycle rather than at the end when identified vulnerabilities are more costly to fix.

In-house Security Officer

Implies a dedicated position within a company in control of information security, cybersecurity, and IT risk management programs.

Proper secrets management

Since secrets contain private and sensitive information, they should never be stored in plaintext. Using secure secret managers like 1Password or LastPass and secret stores such as AWS Secrets Manager, SSM Parameter Store, or HashiCorp Vault helps to protect your sensitive data against cyber thieves.

Risk assessment

A practice of identifying potential cyber threats that could disrupt business, analyzing their consequences, and designing countermeasures.

Regular infrastructure security audits

Going through security audits and having a third-party company conduct penetration testing on your services is an effective way to identify and fix potential issues.

Active external security

A set of measures for protecting websites, services, and networks against possible external attacks; includes the setup of web firewalls, DDoS protection, and intrusion detection system (IDS).

Implementation of SAST and DAST on CI/CD level

Static application security testing (SAST) and Dynamic Application Security Testing (DAST) are included as part of the CI/CD pipeline.

10. Backups and Disaster Recovery

Backup and Disaster Recovery (DR) are DevOps strategies for restoring infrastructure or system components after failover with minimum downtime and data loss. Effective backups and recovery strategies imply redundancy of information and data assets, so you will always have a copy of your data available elsewhere when a disaster strikes.

Technologies: dedicated backup software (Veeam Backup & Replication, Velero, Rsnapshot, FSBackup), snapshots, and BaaS for cloud services.

How much is your business secured against force majeure? Please choose one or more answers.

Multiple choice

DR plan regularly tested

A DR plan has been carefully designed and proved to work by running disaster scenarios and checking the restoration procedures in practice.

Backups are tested

Recovery scenarios are automatically tested by periodically restoring data from created backups to ensure they work.

Infrastructure recovery from code

Though not a backup strategy per se, IaC could be considered a kind of infrastructure backup as it allows for the quick restoration of infrastructure from available code or the configuration of identical environments using the same code.

Backups and snapshots for databases configured

Both server backup and snapshot options are effectively implemented to secure your datasets.

Pilot Light recovery strategy implemented

A DR strategy with a core system functionality configured and running in the cloud or a separate cloud account. Then, when recovery time comes, you can rapidly provision a full-scale production environment around the critical core.

11. Observability

Observability is a DevOps practice of measuring a system’s current state based on the data it generates, including logs, metrics, and traces. Observability makes infrastructure processes visible, allows for data visualization and analysis, and enables effective code debugging and timely troubleshooting of issues.

Technologies: Grafana, Prometheus, Alertmanager, ELK, AWS CloudWatch, Jaeger, and Datadog.

How effective is observability in your company? Please choose one or more answers

Multiple choice

Full coverage of logs and metrics

All critical parameters – including availability metrics, business metrics, application metrics, and server metrics – are being tracked, covered with monitoring, and recorded in logs.

Centralized observability system

A system that aggregates all the data produced by all the IT systems in one place and allows for its single pane of glass management and processing.

Alerts and notifications configured

Alerts aim to notify on-call engineers when critical metrics cross pre-defined thresholds. Most metrics and log tools support alerting and can be integrable with notification tools.

Postmortem analysis procedures are held

Postmortem analysis includes the detailed recording of an incident with its further investigation, identifying a root cause and preventive measures to exclude the possibility of it reoccurring in the future.

SLA/SLO requirements are set

An SLA (service level agreement) is an agreement between a provider and client about measurable metrics like uptime, responsiveness, and responsibilities. An SLO (service level objective) is an agreement within an SLA about a specific metric like uptime or response time.

12. Documentation

Documentation is an effective way to keep internal processes and procedures systemized and available for future reference. Detailed and accurate documentation is a centerpiece of all your must-know information and an advisory for new employees.

Technologies: documentation management systems (Confluence, Nuclino, Read the Docs), GitHub Pages, and Continuous Documentation tools.

How good is your documentation? Please choose one or more answers.

Multiple choice

Documentation precedes production changes

The top level when you first document a desired system and then follow the documentation when designing the system.

Detailed infrastructure documentation and diagrams

Your project documentation includes detailed infrastructure descriptions updated on demand, including IP addresses, physical locations, dependencies, and passwords. Moreover, elements of system design and all connections between them are visualized on architecture diagrams.

Сentralized documentation

Internal documentation is centralized and regularly updated; onboarding documentation exists for newcomers to make a smooth start.