Moneycontrol PRO
Black Friday Sale
Black Friday Sale
HomeTechnologyCloudflare accuses Perplexity of content scraping; Perplexity fires back at “publicity stunt” claims

Cloudflare accuses Perplexity of content scraping; Perplexity fires back at “publicity stunt” claims

According to Cloudflare, this behaviour bypasses standard protocols like robots.txt and undermines trust between AI services and web publishers.

August 05, 2025 / 10:53 IST
Perplexity

Cloudflare has accused Perplexity, an AI-powered search and answer platform, of using hidden and undeclared web crawlers to access website content that was explicitly blocked via site rules. According to Cloudflare, this behaviour bypasses standard protocols like robots.txt and undermines trust between AI services and web publishers.

Cloudflare’s investigation

Cloudflare states that it observed Perplexity evading both robots.txt instructions and network firewalls by utilising undeclared bots that masqueraded as regular web browsers. In tests, Cloudflare created newly registered domains that were invisible to the public, blocked known Perplexity bots, and clearly disallowed crawling in robots.txt.

Despite these measures, Perplexity’s platform was able to retrieve and summarise content from those hidden websites. Cloudflare says this was made possible by stealth crawlers using generic browser identities (like Chrome on macOS) and rotating through unlisted IP addresses and network providers to avoid detection.

What is robots.txt?

The robots.txt file is a publicly accessible file placed on a website’s root directory. It acts as a guide for automated bots or web crawlers, telling them which parts of a website they are allowed—or not allowed—to access. Well-behaved bots, such as those from search engines like Google, typically respect these instructions and avoid crawling disallowed paths.

However, these rules are voluntary and rely on the crawler’s willingness to comply. If a bot is designed to ignore robots.txt, there is no built-in mechanism to stop it—unless network-level blocks are also enforced.

What is AI content scraping?

AI content scraping refers to the process of automatically extracting content from websites to feed into AI models or generate responses. This can include text, images, or structured data. Some platforms use this content to train large language models, while others use it in real time to respond to user queries with direct answers instead of links.

While search engines and aggregators have long used content scraping, the rise of generative AI tools has raised new concerns. Unlike traditional search, which directs users to the source website, AI responses often summarise or paraphrase content without requiring the user to visit the original site—potentially reducing web traffic and undermining publishers.

Cloudflare’s response

Following its findings, Cloudflare has removed Perplexity from its list of verified bots and deployed new security rules to detect and block the company’s stealth crawlers. These rules are available to all Cloudflare customers, including those using its free tier.

Cloudflare emphasised that trustworthy crawlers should identify themselves, respect site directives, and use disclosed IP ranges. It cited OpenAI’s ChatGPT as an example of a platform that respects robots.txt and ceases crawling when blocked.

Perplexity calls it a publicity stunt

Cloudflare has removed Perplexity from its verified bots list after accusing it of using hidden crawlers to access content against site rules. The company said it developed new tools to stop what it calls Perplexity’s “stealth crawling.” In response, Perplexity denied wrongdoing. Speaking to The Verge, spokesperson Jesse Dwyer dismissed Cloudflare’s claims as a “publicity stunt,” saying, “there are a lot of misunderstandings in the blog post.” The disagreement centres on how Perplexity gathers information for its AI search, especially through its Perplexity-User bot, which the company admits “generally ignores robots.txt”, a standard used to block web crawlers.

What Perplexity’s doc states for web crawling

Perplexity lists two bots in its documentation: PerplexityBot, used for indexing content for search, and Perplexity-User, which fetches live content in response to user queries. While PerplexityBot respects robots.txt directives, Perplexity-User “generally ignores” them, meaning it may access websites even if they explicitly disallow bots. The company says neither crawler is used to train AI models and provides IP ranges for both bots. It encourages publishers to allow PerplexityBot to improve visibility in AI search. However, the crawler behaviour—especially of Perplexity-User—has raised questions about compliance with site policies and the boundaries of AI content sourcing.

Why it matters

As AI-driven platforms rely more on real-time web data, tensions between content creators and tech companies are intensifying. Cloudflare’s report adds to the growing debate around AI data access, content control, and the need for clearer governance over how automated systems interact with the open web.

Invite your friends and family to sign up for MC Tech 3, our daily newsletter that breaks down the biggest tech and startup stories of the day

MC Tech Desk Read the latest and trending tech news—stay updated on AI, gadgets, cybersecurity, software updates, smartphones, blockchain, space tech, and the future of innovation.
first published: Aug 5, 2025 10:52 am

Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!

Subscribe to Tech Newsletters

  • On Saturdays

    Find the best of Al News in one place, specially curated for you every weekend.

  • Daily-Weekdays

    Stay on top of the latest tech trends and biggest startup news.

Advisory Alert: It has come to our attention that certain individuals are representing themselves as affiliates of Moneycontrol and soliciting funds on the false promise of assured returns on their investments. We wish to reiterate that Moneycontrol does not solicit funds from investors and neither does it promise any assured returns. In case you are approached by anyone making such claims, please write to us at grievanceofficer@nw18.com or call on 02268882347