Cloudflare accuses Perplexity of content scraping; Perplexity fires back at “publicity stunt” claims

Cloudflare has accused Perplexity, an AI-powered search and answer platform, of using hidden and undeclared web crawlers to access website content that was explicitly blocked via site rules. According to Cloudflare, this behaviour bypasses standard protocols like robots.txt and undermines trust between AI services and web publishers.

Cloudflare’s investigation
Cloudflare states that it observed Perplexity evading both robots.txt instructions and network firewalls by utilising undeclared bots that masqueraded as regular web browsers. In tests, Cloudflare created newly registered domains that were invisible to the public, blocked known Perplexity bots, and clearly disallowed crawling in robots.txt.

Story continues below Advertisement

Remove Ad

Despite these measures, Perplexity’s platform was able to retrieve and summarise content from those hidden websites. Cloudflare says this was made possible by stealth crawlers using generic browser identities (like Chrome on macOS) and rotating through unlisted IP addresses and network providers to avoid detection.

What is robots.txt?
The robots.txt file is a publicly accessible file placed on a website’s root directory. It acts as a guide for automated bots or web crawlers, telling them which parts of a website they are allowed—or not allowed—to access. Well-behaved bots, such as those from search engines like Google, typically respect these instructions and avoid crawling disallowed paths.

However, these rules are voluntary and rely on the crawler’s willingness to comply. If a bot is designed to ignore robots.txt, there is no built-in mechanism to stop it—unless network-level blocks are also enforced.

Story continues below Advertisement

Remove Ad

What is AI content scraping?
AI content scraping refers to the process of automatically extracting content from websites to feed into AI models or generate responses. This can include text, images, or structured data. Some platforms use this content to train large language models, while others use it in real time to respond to user queries with direct answers instead of links.

While search engines and aggregators have long used content scraping, the rise of generative AI tools has raised new concerns. Unlike traditional search, which directs users to the source website, AI responses often summarise or paraphrase content without requiring the user to visit the original site—potentially reducing web traffic and undermining publishers.

Cloudflare’s response
Following its findings, Cloudflare has removed Perplexity from its list of verified bots and deployed new security rules to detect and block the company’s stealth crawlers. These rules are available to all Cloudflare customers, including those using its free tier.
Cloudflare emphasised that trustworthy crawlers should identify themselves, respect site directives, and use disclosed IP ranges. It cited OpenAI’s ChatGPT as an example of a platform that respects robots.txt and ceases crawling when blocked.

Story continues below Advertisement

Remove Ad

Perplexity calls it a publicity stunt
Cloudflare has removed Perplexity from its verified bots list after accusing it of using hidden crawlers to access content against site rules. The company said it developed new tools to stop what it calls Perplexity’s “stealth crawling.” In response, Perplexity denied wrongdoing. Speaking to The Verge, spokesperson Jesse Dwyer dismissed Cloudflare’s claims as a “publicity stunt,” saying, “there are a lot of misunderstandings in the blog post.” The disagreement centres on how Perplexity gathers information for its AI search, especially through its Perplexity-User bot, which the company admits “generally ignores robots.txt”, a standard used to block web crawlers.

What Perplexity’s doc states for web crawling
Perplexity lists two bots in its documentation: PerplexityBot, used for indexing content for search, and Perplexity-User, which fetches live content in response to user queries. While PerplexityBot respects robots.txt directives, Perplexity-User “generally ignores” them, meaning it may access websites even if they explicitly disallow bots. The company says neither crawler is used to train AI models and provides IP ranges for both bots. It encourages publishers to allow PerplexityBot to improve visibility in AI search. However, the crawler behaviour—especially of Perplexity-User—has raised questions about compliance with site policies and the boundaries of AI content sourcing.

Why it matters
As AI-driven platforms rely more on real-time web data, tensions between content creators and tech companies are intensifying. Cloudflare’s report adds to the growing debate around AI data access, content control, and the need for clearer governance over how automated systems interact with the open web.

English

Markets

News

Personal Finance

Mutual Funds

Commodities

Media

Invest Now

Specials

Cloudflare accuses Perplexity of content scraping; Perplexity fires back at “publicity stunt” claims

According to Cloudflare, this behaviour bypasses standard protocols like robots.txt and undermines trust between AI services and web publishers.

Related Stories

Trending Topics

News

Markets

Personal Finance

Mutual Funds

Tools

Community

Network 18 Sites

Quick Links