Cloudflare Outage: How a ClickHouse Permission Change Triggered a Global Bot Management Failure

CyberSecureFox 🦊

On 18 November 2025, Cloudflare, one of the world’s largest CDN and network security providers, experienced one of its most severe outages in recent years. The disruption affected Cloudflare’s global edge network and caused widespread downtime for websites and online services across multiple regions. According to Cloudflare CEO Matthew Prince, the root cause was not a cyberattack, but an internal configuration error linked to a database permission change.

From Suspected DDoS Attack to Internal Configuration Error

The first signs of trouble appeared around 11:20 UTC, when users began reporting intermittent availability issues on services fronted by Cloudflare’s reverse proxy. The pattern of failures — periods of normal operation followed by renewed instability — initially resembled a large-scale DDoS attack and automated mitigation cycles.

Further investigation, however, showed that the issue originated inside Cloudflare’s own infrastructure. During an update to access permissions in a ClickHouse cluster — a high‑performance, distributed analytical database — an error was introduced into the query used to build a critical configuration artifact for Cloudflare’s Bot Management system.

Why ClickHouse and Bot Management Are So Critical to Cloudflare

Cloudflare relies heavily on data analytics to distinguish legitimate user traffic from automated bots and malicious requests. A dedicated ClickHouse cluster generates a so‑called feature file — a structured dataset containing behavioral signals and characteristics used to classify and score bots across the global network.

This feature file is distributed to Cloudflare’s proxy servers worldwide and directly influences how requests are routed, allowed, challenged, or blocked. In practice, it functions as a central configuration element for the Bot Management pipeline, making its correctness and size constraints critical to the stability of the entire platform.

How a Single Query Change Doubled File Size and Crashed Proxies

The goal of the ClickHouse permission change was to expose additional low‑level metrics and metadata to internal users. However, the query used to retrieve these fields was constructed incorrectly and returned far more data than intended. As a result, the generated feature file more than doubled in size.

Cloudflare had enforced a strict upper limit on the size of this file to protect proxy processes. When the new, oversized file exceeded that limit, edge proxies began rejecting it as invalid and crashed as part of their failure behavior. Because the ClickHouse cluster produced a new version of the feature file every five minutes, the system continuously reintroduced the problematic artifact, driving repeated waves of failures.

Flapping and the Illusion of an Active Attack

The situation was further complicated by the fact that not all ClickHouse nodes received the updated permissions at the same time. Only a subset of nodes initially produced the malformed feature file. As a consequence, some proxies fetched a valid configuration, while others pulled the corrupted version.

This led to a classic “flapping” pattern: parts of the network alternated between healthy and degraded states depending on which feature file version they consumed. According to Prince, this oscillation mimicked the behavior typically seen during sophisticated attacks, reinforcing the initial suspicion of an external threat. By around 13:00 UTC, all ClickHouse nodes were generating the bad file, pushing the platform into a “stable failure state” with widespread, persistent impact.

Scope of Impact: European Data Centers and Global Online Services

Monitoring data and media reports indicate that Cloudflare data centers across Europe were heavily affected, including Amsterdam, Berlin, Frankfurt, Warsaw, Vienna, Zurich, Stockholm and other locations. Outage tracking platforms such as Downdetector recorded tens of thousands of user complaints about unreachable websites and hosting platforms.

Users reported disruptions in major online services that rely on Cloudflare for content delivery and security, including Spotify, Twitter, OpenAI, Anthropic, AWS, Google and many others. The incident once again highlighted how a small number of large CDN and security providers have become systemically important infrastructure for the modern internet.

Incident Recovery and Cloudflare’s Technical Remediation Measures

To stop the cascading failures, Cloudflare engineers halted generation of the faulty feature files, manually enqueued a known‑good configuration version, and forced a restart of the primary proxy components. Full restoration of services took roughly six hours, with normal operation reportedly re‑established by about 17:44 UTC.

Cloudflare described the event as its most significant outage since 2019 and publicly apologized to customers. The company outlined a set of corrective actions, including stronger configuration validation for critical files, additional kill switch mechanisms to rapidly disable problematic features, and a review of error‑handling logic across core proxy modules to avoid crash‑loop behavior.

Key Lessons for Security and Infrastructure Teams

This Cloudflare outage illustrates several hard‑won lessons for cybersecurity and SRE teams. First, configuration and data artifacts should be subject to automated validation — including schema checks, size limits, and semantic tests — before they are distributed to production systems. Second, permission and schema changes to critical databases such as ClickHouse should follow staged or canary deployments rather than broad, simultaneous rollout.

Equally important are robust observability tools that can quickly distinguish between internal misconfigurations and external attacks. High‑quality logs, metrics, and tracing reduce diagnostic time and allow incident responders to move from “assume attack” to “verify root cause” much faster. Comparisons can be drawn with previous large‑scale outages at other providers, where single configuration errors in CDNs or DNS platforms cascaded into global disruptions.

For organizations heavily dependent on cloud and CDN providers, this incident is a reminder to review internal processes: how access rights to critical data stores are managed, what automated safeguards exist for configuration artifacts, and how quickly a bad change can be rolled back globally. Building multi‑layered resilience, practicing incident response, and treating configuration as code — with testing, review, and rollbacks — significantly reduces the risk that a single mistake in a feature file will escalate into a worldwide outage.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.