Many devops organizations implement advanced CI/CD, infrastructure as code, and other automations to increase deployment frequency. In the State of DevOps Report 2023, 18% of respondents were classified as elite performers for being able to deploy on-demand with change lead times of less than a day.

However, these elite performers also report a 5% change failure rate, which may be acceptable for applications that aren’t mission-critical or deployments triggered during low-usage periods. But when change failures occur within airline, banking, and other applications that require 5-nines availability (meaning they are 99.999% available), pushing defects and problematic configuration changes to production can lead to deployment horrors.

CrowdStrike recently made headlines with a failed deployment impacting 8.5 million Microsoft Windows computers, causing nearly 10,000 flight cancellations worldwide. Needless to say, the failure created a significant financial impact. CrowdStrike’s root cause analysis classified the issue as a bug, reporting, “The sensor expected 20 input fields, while the update provided 21 input fields. In this instance, the mismatch resulted in an out-of-bounds memory read, causing a system crash.”

It might be time for devops organizations to rethink their deployment strategies. When are frequent releases too risky, and how should teams evaluate the risks of a change to avoid large-scale deployment issues?

Evaluate requirements and implementation risks

Not all releases, features, and agile user stories come with equal deployment risks. Many organizations automate the creation of deployment risk scores and then use them to evaluate what level of testing and operational review is required before a release. Traditionally, risk scores leveraged subjective inputs and asked experts to evaluate each risk’s probability and impact, but organizations deploying frequently can move to a machine learning-driven approach.

“Avoiding deployment horrors starts in the planning phase,” says David Brooks, SVP of evangelism at Copado. “AI can help evaluate user stories to identify ambiguities, hidden dependencies and impacts, and overlapping work by other developers.”

Release management strategies characterized deployments as part of their internal communications and risk management frameworks. A traditional approach characterizes major upgrades, minor improvements, and system upgrades. Devops leaders then specify deployment policies, risk mediation requirements, and automation rules based on release types.

A more data-driven approach will characterize releases and compute risk scores by many other variables, such as the number of users impacted, test coverage of the impacted code, and measures of dependency complexities. Organizations can then implement feedback loops to calibrate risk scores based on releases’ actual business impacts by capturing outages, performance issues, security incidents, and end-user feedback.

Embed security into the developer experience

Finding security issues post-deployment is a major risk, and many devops teams shift-left security practices by instituting devops security non-negotiables. These are a mix of policies, controls, automations, and tools, but most importantly, ensuring security is a top-of-mind responsibility for developers.

“Integrating security and quality controls as early as possible in the software development lifecycle is absolutely necessary for a functioning modern devops practice,” says Christopher Hendrich, associate CTO of SADA. “Creating a developer platform with automation, AI-powered services, and clear feedback on why something is deemed insecure and how to fix it helps the developer to focus on developing while simultaneously strengthening the security mindset.”

Devops teams should consider the following practices to minimize the risk of deployment disasters:

Implement continuous deployment prerequisites

Development teams champion the objective of automating a path to production, but not all devops teams are truly ready for continuous deployment. What starts as an easy objective of implementing CI/CD in production environments can lead to deployment horrors if the right safeguards aren’t in place, especially as application and microservices architecture complexities grow.

“Software development is a complex process that gets increasingly challenging as the software’s functionality changes or ages over time,” says Melissa McKay, head of developer relations at JFrog. Implementing a multilayered, end-to-end approach has become essential to ensure security and quality are prioritized from initial package curation and coding to runtime monitoring.”

Devops teams looking to implement continuous delivery on mission-critical, large-scale applications should institute the following best practices:

  • Continuous testing with high test coverage, comprehensive testing data, and end-user persona-driven testing, including using synthetic data and genAI testing capabilities to minimize defects in production code.
  • Feature flagging so that development teams can control experimental capabilities configured and tested with a targeted set of end users.
  • Canary release strategies to support the deployment of multiple versions of an application or service, control which versions end-users access, and capture issues deployed to this smaller end-user base.

“Implementing software checks at every stage enhances code quality and resiliency while utilizing strategies such as canary testing helps guarantee stable deployments,” adds McKay of JFrog.

Some organizations only have a few large-scale mission-critical applications that require continuous delivery and all its prerequisites. Large enterprises with a significant portfolio of mission-critical applications, data science teams managing AI models in the hundreds, or SaaS companies with several product lines should consider platform engineering practices to drive standards and efficiencies.

Kevin Cochrane, CMO of Vultr, says, “Platform engineering for cloud-native and AI-native applications enhances devops and lowers enterprises’ risk by automating key processes supporting mature and responsible AI, including infrastructure provisioning, model observability, and data governance.”

Continuously improve observability, monitoring, and AIOps

What separates bad deployments from deployment horror stories can be discovered by answering three key questions:

  • What is the business impact, including the number of end users affected, the lost productivity in operations, the financial costs, reputation damage, compliance factors, and legal implications?
  • How long does it take for the organization to identify the issue, communicate the problem to end users, recover from the incident, identify root causes, and implement remediations?
  • How often is the devops team responding to bad deployments, and are the engineers burned out from all the firefighting?

Investments in observability, application monitoring, and AIOps are key operational capabilities that can reduce the business impact and improve the mean time to recovery (MTTR) from major incidents.

“All of today’s devops deployment ‘horrors’—from end-user errors to potential miscommunications—ultimately boil down to a lack of proper communication or visibility,” says Madhu Kochar, VP of product development of IBM Automation. “You can’t fix what you can’t see, and that’s exactly why observability, especially within the context of intelligent automation, is critical for addressing known flaws and providing insight into what’s happening inside your system or application. Observability allows the devops feedback loop to flow without interruption for efficient, better-performing deployment efforts that catch issues before they affect end users.”

Observability, automation, and monitoring are key defensive strategies to alert network operation centers (NOCs) and site reliability engineers (SREs) about major incidents. Organizations applying microservices architectures, deploying to multiple clouds, and connecting to many third-party systems need AIOps solutions to identify incident root causes and trigger automated responses such as deployment rollbacks.

“As the digital landscape rapidly evolves, traditional monitoring tools are no longer sufficient to meet the demands of modern devops,” says Jamesraj Paul Jasper, principal product manager at ManageEngine. “Teams must adopt a system of intelligence, including AI-driven observability and predictive solutions, to stay ahead of potential issues.”

Develop a major incident playbook

When a bad deployment causes a major incident, IT organizations must have an operational playbook to guide their response. Is there an optimally sized and skilled response team that knows what communication tools they’ll use and what application monitors to review? Does the team have a leader to coordinate actions, and are stakeholders, executives, and customers alerted about the issue and well-informed about its status?

During a major incident, the worst-case scenario is to watch everyone from engineers to executives run around like chickens with their heads cut off and without clear directions and communication. It creates delays, missteps, and greater stress. Most organizations develop an IT service management (ITSM) major incident playbook and process to prepare for all types of business-impacting production issues.

Devops teams are compelled to increase the frequency of deployments and deliver new capabilities to application end users. Every deployment carries risks, and bad deployments will require investigating root cause analyses and implementing remediations. However, the smarter devops organizations balance speed with preparedness to avoid deployment horrors.