Cloudflare Wraps Up 'Fail Small' Initiative: A Stronger, More Resilient Network
Over the past two-plus quarters, Cloudflare’s engineering teams have been immersed in an intensive internal project known as “Code Orange: Fail Small.” The goal was to fortify the company’s infrastructure so it can better withstand failures, protect customer traffic, and recover faster from incidents. Earlier this month, the final work packages were completed, marking a major milestone—though the team emphasizes that reliability is an ongoing journey, not a finish line. This initiative directly addresses the root causes of two global outages that occurred on November 18 and December 5, 2025, ensuring those specific failure modes are prevented going forward.
Key Areas of Focus
The project concentrated on four main pillars: safer configuration changes, reducing the impact of failures, revised break glass procedures, and improved incident management. Additionally, new safeguards were introduced to prevent configuration drift and regressions over time, and the way Cloudflare communicates with customers during outages was strengthened.

Safer Configuration Changes with Snapstone
One of the most significant changes is how Cloudflare handles configuration deployments. Previously, internal configuration changes could propagate across the entire network instantly. That posed a risk: if a change contained a flaw, it could affect all traffic before anyone noticed. Now, a new system called Snapstone brings health-mediated deployment to configuration updates—the same proven methodology used for software releases.
Snapstone works by bundling configuration changes into packages. These packages are then rolled out gradually across the network. During the rollout, real-time health monitoring continuously checks for anomalies. If a problem is detected, the change is automatically rolled back, often before any customer impact occurs. This approach was not consistently applied before; each product team had to build its own mechanism, which led to gaps. Snapstone closes that gap by providing a unified, default system for all high-risk configuration pipelines.
The system is flexible by design. It is not limited to the specific data file or control flag that caused the past outages. Teams can define any unit of configuration that needs mediation—whether it’s a routing table, a firewall rule, or a global feature flag. This ensures that even future, unforeseen configuration types are subject to the same safety net.
Reducing Failure Impact and Streamlining Incident Response
Beyond configuration safety, the team re-engineered how failures affect the network. The “Fail Small” philosophy means containing the blast radius of any single issue. For example, if a particular service fails, the rest of the network should continue operating normally. This involved redesigning internal architectures to isolate failure domains, adding circuit breakers, and improving load shedding mechanisms.
Incident management processes were also overhauled. The break glass procedures—emergency access overrides used during critical outages—have been made clearer and safer. The team introduced stricter verification steps and temporary access tokens with automated expiration, reducing the risk of accidental long-term exposure. Additionally, post-incident reviews now feed directly into the development cycle, ensuring that lessons learned become permanent improvements.

Preventing Drift and Regressions
Cloudflare’s network is constantly evolving. To prevent past fixes from being undone by future changes, the team built automated regression detection tools. These tools continuously compare the current configuration against known safe baselines. If a deviation is detected that matches a previously resolved incident pattern, the tools alert the operations team instantly. This ensures that hard-earned reliability gains are not lost over time.
Transparent Customer Communication During Outages
When things go wrong, clear communication is critical. Cloudflare revised its outage notification templates and internal escalation paths to provide more timely, accurate status updates. The goal is to share what is known, what is being done, and an estimated resolution timeframe—while being honest about uncertainties. This process was stress-tested during recent drills and will continue to be refined.
Snapstone in Action: How Health-Mediated Deployment Works
To understand Snapstone’s value, consider a typical configuration change. Before deployment, the change is packaged with its expected health metrics. The package is then released to a small subset of the network. If health indicators (e.g., error rates, latency, CPU usage) remain within acceptable thresholds, the rollout expands. If any metric breaches a threshold, the package is automatically rolled back and the team is notified. This process is transparent to the affected product teams, who now get safety by default without extra effort.
What This Means for Cloudflare Customers
The immediate benefit is reduced risk of widespread outages caused by configuration errors. For most customers, internal changes now take effect gradually, with automated safeguards that detect and revert problems before they affect traffic. Overall, the network is more resilient to unexpected failures, and the team is better equipped to respond when they do occur. While no system can be perfect, Cloudflare’s “Fail Small” initiative represents a significant step forward in keeping the internet running smoothly.
For more details on specific technical implementations, refer to the sections above on safer configuration changes and Snapstone’s health-mediated deployment.