35 KiB
title | description | date | tags | outputs | syndicatedCopies | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Scrapers I block (and allow), with explanations | Here’s my thought process when deciding whether to block a scraper from seirdy.one, the scrapers I block, the scrapers I allow, and the ways I block them. | 2024-12-10T00:37:54-05:00 |
|
|
|
Introduction
An old version of my robots.txt file was full of documentation on each user-agent I blocked. Over time, other websites started using my robots.txt file as a reference. I'm humbled by how useful others find my robots.txt file.1
Unfortunately, a plain text file in a browser window isn't the most readable experience for everybody. It has no headings for quick navigation, inconsistent support for soft line-wrapping, and poor support for hyperlinks.
I'm compiling documentation on the bots I block here, including my thought process for deciding whether to block or allow a bot. This isn't an alternative to larger resources on bad-bot lists, but a secondary resource for them to cite. If you compile a larger list of bots, feel free to cite this page.
At the same time, I hope to make this an alternative to generic bad-bot lists with more nuance. I even go so far as explaining why I allow certain commonly-blocked bots.
Skip to the list of bots I block if that's all you're here for.
Criteria for bad-bot blocking
Bots I block fall into one of the following categories:
-
Bots that only serve to power adtech on other sites. My site has no ads, but I allow bots such as Google's AdsBot.
-
Intellectual property snitches. I forbid robots that scan for plagiarism, trademark/copyright violations, brand protection, etc.
-
Robots that power invasive background checks that border on cyberstalking.
-
Scrapers that build datasets to train Generative AI (GenAI), such as large language models (LLMs). I don't block search clients used by GenAI research assistants; I only block scrapers used to train GenAI models.
On contradictions between intellectual-property snitches and GenAI scrapers
Some people think opposing GenAI scraping is support for intellectual property, and opposing intellectual property means support for GenAI scraping.
My distaste for modern intellectual property protections comes from seeing how they stifle creativity. My view on the relationship between IP and GenAI scrapers us that much of (not all of) IP law and GenAI are bad for creators. Policies on scraping, such as parts of the EU "CDSM Directive" relevant to the Text and Data Mining Protocol (discussed later), have the potential to strike the right balance between them on the Web. I do not welcome copyright snitches that work against remix culture, and GenAI scrapers that devalue and replace creative labor (remix or not), on seirdy.one.
How I block bots
Here's how I block bots, from the most polite approaches to the most severe restrictions.
X-Robots
X-Robots
tags control what to do with a page after it has been crawled. They do not opt out of mere crawling: they opt-out of processing pages after they've been discovered by a crawler.
I set X-Robots
tags in every page that forbid training Generative AI algorithms but accept just about everything else. Google Search Central lists its supported directives.
{{}}DeviantArt popularized the NoAI
X-Robots
tag in {{}}{{}}, wich Cohost and Misskey since implemented. The img2dataset scraper respects it.
In September 2024, Bing announced support for a nocache
robots directive and hijacked the existing noarchive
directive.
nocache
allows Microsoft to do LLM training only using search engine result titles and snippets, and preserves visibility in Bing Chat.noarchive
completely opts a site out of Bing Chat and Microsoft's LLM training.
I adopted nocache
, as I still want my site to support real archiving services.
robots.txt
robots.txt is meant to opt out of crawling, to reduce server load. It does not opt you out of further processing of crawled pages. Data miners can still fetch your pages without crawling them: they can fetch archived snapshots, use data collection in users' browsers or browser extensions, download or buy datasets, etc. X-Robots
tags are the only standard vendor-neutral format for opting out of processing of crawled pages.
There is no way to opt out of both crawling and further processing. A page's policies on permitted forms of processing can only be found by visiting the page and reading those policies in X-Robots
tags.
That said, many robots do not respect X-Robots
tags, or an X-Robots
tag doesn't exist to prevent the type of processing I want to prevent (e.g. IP-snitching). Often, a robots.txt entry is the only documented way to opt out of something. In these cases, I compromise and hijack my robots.txt to opt out.
Comparing robots.txt, X-Robots, and the TDM Reservation Protocol
Ideally, X-Robots
would support per-engine rules in a standard format, such as robots.txt. Unfortunately, it doesn't. Unlike robots.txt, it has per-page rules that persist on archives of a given page, potentially saved in other data sets.
A successor to X-Robots
called the Text and Data Mining (TDM) Reservation Protocol may close the gap between X-Robots
and robots.txt. It allows setting policies on data-mining to forbid LLM training, among other things.
The noai
X-Robots
directive makes the same mistake as the "Do Not Track" HTTP request header, in that it's a vague polite request with no incentive for compliance. Crawlers generally respect robots.txt because they get caught and banned if they don't; indexers respect noindex
because it prevents duplicate, internal, and shallow pages from getting indexed. There's no incentive to respect noai
.
The TDM Reservation Protocol is less like "Do Not Track" and more like Global Privacy Control: observance of the TDM Reservation Protocol could be a way to qualify for an exception under the European Union's "CDSM Directive". It could have a clear legal meaning and legal incentive, which gives companies a very good reason to respect it. After the TDM Reservation Protocol finalizes and is shown to work, I may adopt it.
User-Agent regex denial
I block certain bots by matching their User-Agent strings against a regular expression and serving them a "403: Forbidden" HTTP error. I don't want to welcome these robots on seirdy.one. They typically don't obey (or have a history of disobeying) X-Robots
or robots.txt. This is a last resort. I need well-behaved robots to read my X-Robots
tags and robots.txt so they can know that I'm opting out of data collection. By doing so, they can also skip my site when scraping third-party datasets and drop existing copies of my pages from their datasets.
Currently-relevant bots I block in robots.txt
Here's the list of bots, their product tokens, and information on the bots.
A product token populates the User-Agent
line in a robots.txt entry. It doesn't necessarily map to the robots' User-Agent
HTTP request headers, but it should have a strong resemblance.
{{}}
{{}} {{RFC 9309: Robots Exclusion Protocol" extraName="headline" url="https://www.rfc-editor.org/rfc/rfc9309.html">}} {{}}The product token SHOULD be a substring of the identification string that the crawler sends to the service. For example, in the case of HTTP RFC9110, the product token SHOULD be a substring in the User-Agent header.
{{}}
peer39 crawler
- Product tokens
peer39_crawler
peer39_crawler/1.0
- Reason
- Adtech
- Information
- {{}}{{}}{{}}
- Relevant quote:
By allowing us access, you enable the maximum number of advertisers to confidently purchase advertising space on your pages. Our comprehensive data insights help advertisers understand the suitability and context of your content, ensuring that their ads align with your audience's interests and needs. This alignment leads to improved user experiences, increased engagement, and ultimately, higher revenue potential for your publication.
TurnItInBot
- Product tokens
turnitinbot
- Reason
- Intellectual property snitch.
- Information
- {{}}{{}}{{}}
- Relevant quote:
This robot collects content from the Internet for the sole purpose of helping educational institutions prevent plagiarism. In particular, we compare student papers against the content we find on the Internet to see if we can find similarities.
- Commentary
- TurnItIn also offers false-positive-ridden LLM-detection services, described in {{}}{{}} by {{}}{{}}. This forces students with original papers to adjust their writing styles, distancing themselves from LLM stylometric fingerprints.n Nn-native English speakers find this especially hard.
AcademicBotRTU
- Product tokens
AcademicBotRTU
- Reason
- Intellectual property snitch.
- Information
- {{}}{{}}{{}}
- Relevant quote:
AcademicBotRTU is a web crawler developed for indexing websites and documents against which to compare and match academic works of students and researchers to help educational and scientific institutions finding and preventing plagiarism.
BLEXBot
- Product tokens
BLEXBot
- Reason
- An intersection of adtech and search engine optimization. The mention of detecting
legal issues
on the page was also concerning. - Information
- BLEXBot information page (unencrypted, no TLS)
- Archive of BLEXBot information page (with TLS)
- Relevant quote:
BLEXBot assists internet marketers to get information on the link structure of sites and their interlinking on the web, to avoid any technical and possible legal issues and improve overall online experience.
- Commentary
- This one doesn't seem too major, compared to the others. I might have over-reacted.
CheckMark Network's crawler
- Product tokens
CheckMarkNetwork/1.0 (+https://www.checkmarknetwork.com/spider.html)
- Reason
- Intellectual property snitch
- Information
- {{}}{{}}{{}}
- Relevant homepage quotes:
We provide comprehensive trademark monitoring and specialized watch packages in all jurisdictions.
andThe Internet is just way too big to effectively police alone.
- Commentary
- The product token is awful. It should just be
CheckMarkNetwork
, a subset of the User-Agent string excluding variable version indicators and URLs.
BrandVerity
- Product tokens
BrandVerity/1.0
- Reason
- Intellectual property snitch
- Information
- {{}}{{}}{{}} [sic]
- {{}}{{}}{{}}
- Relevant homepage quote:
Maximize CTRs and minimize CPCs by identifying and removing harmful ads from appearing on your most popular branded keywords. Protect your affiliate channel from abuse by policing and taking down unauthorized ads.
PiplBot
- Product tokens
PiplBot
- Reason
- Invasive background checks
- Information
- {{}}{{}}{{}}
- {{}}{{}}{{}}
- Relevant quote:
Pipl assembles online identity information from multiple independent sources to create the most complete picture of a digital identity and connect it to real people and their offline identity records. When all the fragments of online identity data are collected, connected, and corroborated, the result is a more trustworthy identity.
OpenAI GenAI training-data scrapers
- Product tokens
GPTBot
- Reason
- GenAI training-data scraper
- Information
- {{}}{{}}{{}}
- Relevant quote:
GPTBot is used to make our generative AI foundation models more useful and safe. It is used to crawl content that may be used in training our generative AI foundation models. Disallowing GPTBot indicates a site’s content should not be used in training generative AI foundation models.
- Commentary
- The other two bots listed,
OAI-SearchBot
andChatGPT-User
, aren't used for training LLMs. If you don't want your work to train an OpenAI LLM, blocking them won't accomplish that. It could serve as a form of protest against the company.
Google-Extended
- Product tokens
Google-Extended
- Reason
- GenAI training-data opt-out signal, not an actual bot
- Information
- {{}}{{}}{{}}
- Relevant quote:
Google-Extended
is a standalone product token that web publishers can use to manage whether their sites help improve Gemini Apps and Vertex AI generative APIs, including future generations of models that power those products. Grounding with Google Search on Vertex AI does not use web pages for grounding that have disallowed Google-Extended. Google-Extended does not impact a site's inclusion or ranking in Google Search.
Applebot-Extended
- Product tokens
Applebot-Extended
- Reason
- GenAI training-data opt-out signal, not an actual bot
- Information
- {{}}{{}}{{}}
- Relevant quote:
With Applebot-Extended, web publishers can choose to opt out of their website content being used to train Apple’s foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools.
Anthropic's ClaudeBot
- Product tokens
ClaudeBot
- Reason
- GenAI training-data scraper
- Information
- {{}}{{}}{{}}
- Commentary
- Anthropic AI was extremely late in posting robots.txt guidance, after it already scraped plenty of data. It does seem well-behaved now.
Meta GenAI training-data scrapers
- Product tokens
FacebookBot
Meta-ExternalAgent
- Reason
- GenAI training-data scraper.
- Extreme lack of trust in Meta/Facebook.
- Information
- {{}}{{}}{{}}
- {{}}{{}}{{}}
- Relevant quotes:
The Meta-ExternalAgent crawler crawls the web for use cases such as training AI models or improving products by indexing content directly.
andFacebookBot crawls public web pages to improve language models for our speech recognition technology.
- Commentary
FacebookBot
used to be the opt-out signal for GenAI data scraping. After hundreds of sites opted out, Meta quietly introducedMeta-ExternalAgent
as their new GenAI data scraping product token and dropped GenAI from their descrption ofFacebookBot
. This effectively required everyone to learn how to opt-out once again; in that time, they scraped plenty of sites that had previously opted out. I block both product tokens for good measure.
Cotoyogi
- Product tokens
Cotoyogi
- Reason
- GenAI training-data scraper
- Information
- Bot info page
- Parent page indicates it builds LLMs from the data
Webz.io
- Product tokens
Webzio-extended
- Reason
- GenAI training-data opt-out signal, not an actual bot
- Information
- {{}}{{}}{{}}
- Relevant quote:
This bot takes things a step further. It analyzes your content to see if you’ve indicated it’s forbidden for AI usage, letting big data applications know if the data is allowed for AI purposes. This indicator will be clearly marked on the data itself and reflected in our Terms of Service.
Kangaroo Bot
- Product tokens
Kangaroo Bot
- Reason
- GenAI training-data scraper
- Information
- {{}}{{}}{{}}
Open Web Search GenAI-targeted opt out
- Product tokens
GenAI
- Reason
- GenAI training-data opt-out signal, not an actual bot
- Information
- {{}}{{}}{{}}
- Relevant quote:
we also support the user-agent identifier GenAI, representing any data use for the purposes of training generative AI models...OpenWebSearch.EU forwards any information about the publishers' usage preferences to the users of our web index and all additional data products we publish through an INDEX as well as a GENAI Metadata field, both represented as boolean values.
- Commentary
- I'm actually a fan of the Open Web Search project. It's important to have open indexes of the Web such as the Common Crawl and Open Web Search for projects like upstart search engines, and for research. This is a signal visible to all users of the Open Web Search project's index indicating that I don't want to train GenAI. I very much want Owler to visit my site!
Certain Semrush product tokens
- Product tokens
SemrushBot-OCOB
SemrushBot-FT
- Reason
- Intellectual property snitch.
- LLM-based slop generation tool I do not wish to support.
- Information
- {{}}{{}}{{}}
- Relevant quote:
block SemrushBot-OCOB from crawling your site for ContentShake AI tool
andblock SemrushBot-FT from crawling your site for Plagiarism Checker and similar tools
Velen
- Product tokens
VelenPublicWebCrawler
- Reason
- GenAI training-data scraper
- Information
- {{}}{{}}{{}}
- Relevant quote:
Our goal with this crawler is to build business datasets and machine learning models to better understand the web.
Historical bots I might not block
It's harmless to block these, but I might stop including some of these lest others cargo-cult obsolete entries.
NPBot
- Product tokens
NPBot
- Reason
- Intellectual property snitch.
- Information
- Archived NameProtect bot info page
SlySearch/iThenticate
- Product tokens
SlySearch
- Reason
- A predecessor of TurnItInBot.
- Information
- Archived copy of SlySearch homepage.
Bots I ban with User-Agent rules
These robots either don't respect robots.txt, have a history of disrespecting it, or have a sufficiently bad reputation. I withheld some entries that I block for more personal reasons. I use case-insensitive regular expression rules to ban their traffic on the server/reverse-proxy level.
Be aware that version numbers in User-Agent strings may change.
{{}}{{}}
In my Nginx configs, I set the $blocked_agent
variable on bad user agents. A safe if
directive against $blocked_agent
returns a 403 HTTP error. Here's a tiny sample of how I set $blocked_agent
in the http
context:
{{}}
map $http_user_agent $blocked_agent {
default 0;
~*(?:ai[0-9]bot|allenai\.org) 1;
~*(?:chatglm.spider|chatglm\.cn) 1;
}
{{}}
ChatGLM
- User-Agent strings
- Mozilla/5.0 (compatible; ChatGLM-Spider/1.0; +https://chatglm.cn/)
- Information
- Official ChatGLM website
- Regex filter
chatglm.spider|chatglm\.cn
- Commentary
- Only hit my site twice. No documentation on robots.txt compliance.
Ai2Bot
- User-Agent strings
- Mozilla/5.0 (compatible) AI2Bot (+https://www.allenai.org/crawler)
- Mozilla/5.0 (compatible) Ai2Bot-Dolma (+https://www.allenai.org/crawler)
- Information
- {{}}{{}}{{}}
- Relevant quotes:
The AI2 Bot explores certain domains to find web content. This web content is used to train open language models
andThis user agent string can be used to filter or reject traffic from our crawler if desired.
- Regex filter
ai[0-9]bot|allenai\.org
- Commentary
- This robot used a slightly different User-Agent string to reach my site than the one advertised on the crawling notice (I saw a -Dolma suffix in my logs).
FriendlyCrawler
- User-Agent strings
- Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/605.1.15 (KHTML, like Gecko; compatible; FriendlyCrawler/1.0) Chrome/120.0.6099.216 Safari/605.1.15
- Information
- {{}}{{}} by {{}}{{}} has good coverage of this bot.
PerplexityBot
- User-Agent strings
- Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://docs.perplexity.ai/docs/perplexity-bot)
- Product tokens
PerplexityBot
- Information
- {{}}{{}}{{}}
- Commentary
- {{}}{{}} wrote {{}}{{}}, disobeying robots.txt in the process too. {{}}WIRED confirmed these findings in {{}}{{}}.
Perplexity doesn't train new LLMs, but it has lost enough of my trust.
Exceptions: scrapers I allow, despite meeting some block-criteria
AdsBot
Advertising-averse webmasters often block Google's AdsBot and its cousins. I allow it, but place the following well-known resources in my site's root:
/ads.txt
/app-ads.txt
(possibly redundant)
These files contain the list of authorized ad sellers for a domain. I use a standard placeholder to declare that there exist no authorized ad sellers for seirdy.one
. Any client-side malware that injects ads into my site won't make money via ad networks adhering to this specification.
{{}}{{}}
These are the contents of my /ads.txt
and /app-ads.txt
.
{{}}
placeholder.example.com, placeholder, DIRECT, placeholder
{{}}
You can read more in the ads.txt specification on {{}}{{}} by the IAB Tech Lab{{}}. I learned about this technique via an article from GrapheneOS describing their use of /ads.txt
.
CCBot and Owler
I allow CCBot, the robot powering Common Crawl. Common Crawl crawls the web to build a public index that anybody can use. People often use it for research; the Web Data Commons is my favorite example. People also use it to build new search engines, since building an index is one of the hardest parts of starting one: [Alexandria and initial versions of Stract]({{<relref "/posts/search-engines-with-own-indexes/#smaller-indexes-or-less-relevant-results">}}) use the Common Crawl as their index (Stract has since switched to its own crawler).
Ever since Google used Common Crawl to power initial versions of Bard's LLM (later renamed to Gemini, Google's brand name shared across all its generative-AI products), many sites started to block CCBot in their robots.txt file. I understand why: an open index for anybody to use also invites use for LLM training, so sites blocking LLM training-data scrapers would want to block CCBot. I personally find the non-GenAI utility of Common Crawl too large to ignore.
The Open Web Search Project shares similar goals with Common Crawl, so I allow their "Owler" crawler for similar reasons. I block their GenAI
product token only to add metadata to their data set indicating to users that I'm opting out of GenAI training.
Some SEO crawlers: MJ12bot, AhrefsBot, etc.
I frown on most Search Engine Optimization (SEO) services, given how they've contributed to turning the most reachable Web pages into "content" that optimizes for ranking well on search algorithms at the cost of having genuine, useful writing that shows a love for a topic. I call this phenomenon [the "Living Dead Internet Theory"]({{<relref "/notes/living-dead-internet-theory/">}}). I still permit most SEO crawlers, though.
Many of them have other uses. MJ12bot powers Majestic's web index, which Majestic uses for historical exploration, a (now offline?) search engine, and The Majestic Million. The Majestic Million also partially powers Tranco, which in turn supports a large body of research. Ahrefs is another SEO company that uses AhrefsBot to find SEO backlinks, but also started using its index for the Yep.com search engine.
For the ones that don't have other uses, I don't have an issue with backlink-finding. I have issues with other SEO tools, such as keyword research and analytics to optimize "content" for engagement rather than to write about what an author feels like sharing.2 My understanding is that crawling mostly powers backlink-finding.
LLM clients
GenAI companies use these bots to power LLM-based tools. They don't train GenAI but function as clients of LLMs to e.g. fetch search results. These include:
cohere-ai
- OpenAI's
OAI-SearchBot
,ChatGPT-User
YouBot
of You.com. I am not sure about this one.
To-do
I need to investigate Diffbot. It fetches data for a commercial dataset useful for LLM training; it also powers article extraction for services such as Instapaper. I need to weigh the range of its uses as I did for CCBot before I make a decision.
I should invest more time thinking about bots that don't fit my current criteria but aren't well-behaved (too many requests, not obeying robot directives such as X-Robots
or robots.txt). Even if my site can weather their load, I should still use this page to alert others about them if possible.
Conclusion
I hope this helps more webmasters learn not just about certain bots worth blocking, but also about the nuance involved. I split things across robots.txt, X-Robots
, and banned User-Agent strings in my server configs. I allow certain often-blocked bots to slip through.
Acknowledgements and other resources
I found most bots via my server logs, but some third party collections proved helpful.
- Dark Visitors
- I don't recommend using this as an authoritative source, as it's not terribly detailed and invests little nuance in each entry's explanation. Several inaccurate entries and false-positives exist. Still, Dark Visitors is useful resource for anybody willing to fact-check what they find.
- ai.robots.txt
- Like Dark Visitors, this is a repository of information about GenAI-related robots with some inaccuracies and false-positives. Fact-check any findings before accepting them.
- Matomo's Device Detector repository
- Browsing Device Detector's
bots.yml
usingyq
helped me learn about a few entries. - Cloudflare Verified Bots
- I don't recall actually discovering a bot to block via Cloudflare Radar, but I'm linking it here as it's a helpful bot catalog in general. Be aware of errors in its bot categorization. Several bots listed under "AI Crawler" aren't actually training or even related to GenAI. Some "Search Engine Crawler" entries are non-crawling scrapers that only fetch favicons for search results. This is a good starting point for further research if you fact-check entries yourself.
- Helpful individuals
- Thanks to {{}} for telling me about Open Web Search adding a
GenAI
product token for opting out of GenAI training, and about SemRush's array of product tokens.
-
I also got some secondhand embarrassment from some sites that copied my robots.txt file verbatim, even preserving my sitemap URL at the end! ↩︎
-
I know "not all analytics" optimize engagement. Not all analytics are problematic for the same reason; different kinds of analytics are problematic for different reasons. ↩︎