Understanding Cloudflare’s DNS Query Analytics
How Cloudflare analyzes 1M DNS queries per second
I enjoy reading articles about technical scenarios, which makes it easier for me to visualize large systems. Also, many of the insights I get from them help me quickly solve the problems I face at work. This makes me think differently when making decisions. I usually write my findings in a notebook. At this time, I wanted to write them in a larger format. This way I learn better myself.
For those who do not plan to read the entire original blog:
Cloudflare is a major Internet service provider that offers essential DNS services to the digital world. In a blog post from late 2017, they mentioned transmitting 1 million DNS queries per second, though it’s likely this figure has been exceeded in recent years. They log the data of these DNS queries and have developed a system that can analyze 1 million DNS query logs per second.
If you’re interested, You can access the blog post about the introduction of the analytics dashboard, which is the product of this system, from the link: Meet The Brand New DNS Analytics Dashboard
They offer DNS service for free on a limited basis, which is great for personal websites.
Here are my key takeaways from the article, organized by chapter:

How Logs Come In from the Edge
- Logs are formatted in Cap’n Proto for optimizing data handling.
- Added to the reading list: Cap’n Proto: Introduction
- Apache Kafka plays a crucial role in managing data flows to solving problems out of the box.
- Previously, data was analyzed using Parquet files extracted from databases.
About Aggregations
- The lack of indexing in the data that Apache Spark pulls in Parquet format makes it unsuitable for online use because it requires reading the entire table each time.
- Added to the reading list: Explore Parquet Index on GitHub — parquet-index
- The shift to OLAP systems like Druid (100B events/day) and Yandex ClickHouse enhances handling massive daily events.
And Then It Clicked
- Added to the reading list: ZooKeeper — Apache ZooKeeper is an effort to develop and maintain an open-source server that enables highly reliable distributed coordination.
- They require low-level implementations in Go, necessitating the configuration of tools to meet their specific requirements, rather than solely relying on pre-built solutions.
- Added to the reading list: MergeTree — ClickHouse uses MergeTree.
- Added to the reading list: ClickHouse Primary Keys — Clickhouse creates sparse index files to avoid a full scan.
- Sorting and indexing the data according to the usage provides continuous read usage which increases the performance.
- Instead of optimizing the indexing for different uses, it is better to create different tables.
Infrastructure and Data Integrity
- They started with raid 10 and then switched to raid 0. It was not possible to do a hot swap and rebuild the lost disk while the system was running intensively. Instead, they preferred 3-way replication.
- Even the filesystem used has an impact on the system.
Visualizing Data
- Added to the reading list: Apache Superset — Superset is a modern data exploration and data visualization platform.

In conclusion, Cloudflare’s DNS Query Analytics showcases a robust system capable of handling immense data volumes, optimizing data handling with Cap’n Proto formatting, Apache Kafka data management, and transitioning to OLAP systems like Druid and Yandex ClickHouse for efficient event handling. The utilization of low-level implementations in Go and strategic sorting/indexing techniques demonstrates a commitment to customizing tools for optimal performance. Infrastructure choices, including RAID configurations and filesystem selection, highlight considerations for data integrity and system reliability. Visualizing data through tools like Apache Superset further enhances insights and decision-making capabilities.