5 Critical Lessons from a ClickHouse Billing Slowdown: How We Uncovered a Hidden Bottleneck

At Cloudflare, our billing pipeline depends on millions of daily ClickHouse queries to calculate usage-based charges. When those aggregation jobs suddenly slowed to a crawl after a routine migration, the entire revenue reconciliation process was thrown into jeopardy. Standard diagnostics showed no obvious issues: I/O was normal, memory usage fine, rows scanned and parts read within expected ranges. Yet the pipeline kept lagging. This article reveals the hidden bottleneck we discovered deep inside ClickHouse's internals, the three patches we developed, and the key lessons any ClickHouse operator can apply to avoid similar surprises.

1. The Petabyte-Scale Platform That Made Per-Namespace Retention Essential

Cloudflare built "Ready-Analytics" in early 2022 to simplify onboarding for internal teams. Instead of designing custom tables, teams stream data into one massive table disambiguated by a namespace. Each record follows a standard schema with 20 float fields, 20 string fields, a timestamp, and an indexID. The primary key is (namespace, indexID, timestamp), which allows each namespace's data to be sorted optimally for its queries. By December 2024, the system held over 2 PiB of data and ingested millions of rows per second. This architecture was popular—hundreds of applications used it—but imposed a glaring limitation: a single 31-day retention policy applied to all namespaces. Teams needing longer or shorter retention had to opt for a more complex conventional setup, which prevented them from using Ready-Analytics.

5 Critical Lessons from a ClickHouse Billing Slowdown: How We Uncovered a Hidden Bottleneck — Source: blog.cloudflare.com

2. The One-Size-Fits-All Retention Policy That Created a Bottleneck

Cloudflare used ClickHouse long before native TTL features existed, so they built a custom retention system based on table partitioning. The Ready-Analytics table was partitioned by day, and a scheduled job simply dropped partitions older than 31 days. While this worked for some teams, it was a major constraint for others. Legal or contractual obligations sometimes required data retention for years, while other teams needed only a few days. The rigid 31-day window forced those use cases to abandon Ready-Analytics. The need for per-namespace retention became urgent. However, implementing such a system introduced a hidden performance hit—especially when combined with the way ClickHouse merges parts and handles primary keys.

3. The Mysterious Slowdown: When Everything Looked Normal

Following a migration, daily aggregation jobs that usually finished in minutes started taking hours. The first response was to check every typical suspect: I/O wait times were low, memory pressure was absent, rows scanned and parts read matched historical patterns. The query profiles showed nothing unusual. The team began to suspect that the problem was not in the queries themselves but deeper in ClickHouse's internal processes. They examined merge operations, data part organization, and index usage. After extensive profiling, they discovered that the bottleneck was not in scanning old data or writing new data but in a specific phase of the merge process—one that was triggered by the mismatch between the primary key order and the per-namespace retention scheme.

4. Digging Deeper: The Hidden Bottleneck Inside ClickHouse Internals

The root cause lay in how ClickHouse handles merges when the primary key includes a namespace string that is not perfectly aligned with the partition key. With per-namespace retention, some partitions had to be partially dropped—only parts belonging to expired namespaces—while retaining others. This forced ClickHouse to perform "partial merges" that required scanning all parts to determine which rows to keep. The memory and CPU overhead grew linearly with the number of distinct namespaces, which was in the hundreds. Furthermore, the indexID field—a string in the primary key—was not optimal for range filtering during these merges. The team traced the issue to three specific areas: inefficient predicate pushdown, suboptimal part pruning during merges, and a missing optimization for filtering by namespace when retention boundaries did not align with partition boundaries.

5. The Three Patches That Fixed It—and What They Teach Us

We wrote and deployed three patches to resolve the bottleneck. First, we optimized predicate pushdown so that namespace-based filters are applied earlier in the merge pipeline, reducing the amount of data scanned. Second, we improved part pruning to skip entire parts when the namespace filter is known to be exclusive. Third, we introduced a smarter merge scheduler that recognizes partial retention merges and allocates resources more efficiently. After these patches, query times returned to normal—and in some cases improved—because the system no longer wasted cycles scanning data that would be discarded. The key lessons for any ClickHouse operator: (1) align partition and primary key design with retention policies, (2) monitor merge behavior especially when using string-based keys, and (3) don't assume that standard metrics always capture the real bottleneck.

Conclusion: The billing pipeline slowdown taught us that even in a well-tuned ClickHouse deployment, hidden inefficiencies can lurk in the interaction between data organization and internal processes. By understanding how merges, primary keys, and retention interact, you can prevent similar surprises. The three patches we contributed to the open-source ClickHouse project now help the entire community avoid this class of performance trap.

Tags: