--- title: "Scrapers I block (and allow), with explanations" description: "Here’s my thought process when deciding whether to block a scraper from seirdy.one, the scrapers I block, the scrapers I allow, and the ways I block them." date: 2024-12-10T00:37:54-05:00 tags: - web - AI outputs: - html syndicatedCopies: - title: 'The Fediverse' url: 'https://pleroma.envs.net/objects/c022c39f-5cab-49c8-bf84-7504326e87dc' - title: 'BlueSky' url: 'https://bsky.app/profile/seirdy.one/post/3lcwjwqkc5k2z' - title: 'jstpst' url: 'https://jstpst.net/f/articles/12091/scrapers-i-block-and-allow-with-explanations' ---
## Introduction [An old version of my robots.txt file](https://git.sr.ht/~seirdy/seirdy.one/blob/b9536a6a9d592d644d53a65b952bd7c95a0faa04/static/robots.txt) was full of documentation on each user-agent I blocked. Over time, other websites started using my robots.txt file as a reference. I'm humbled by how useful others find my robots.txt file.[^1] Unfortunately, a plain text file in a browser window isn't the most readable experience for everybody. It has no headings for quick navigation, inconsistent support for soft line-wrapping, and poor support for hyperlinks. I'm compiling documentation on the bots I block here, including my thought process for deciding whether to block or allow a bot. This isn't an alternative to larger resources on bad-bot lists, but a secondary resource for them to cite. If you compile a larger list of bots, feel free to cite this page. At the same time, I hope to make this an alternative to generic bad-bot lists with more nuance. I even go so far as explaining why I _allow_ certain commonly-blocked bots. [Skip to the list of bots I block](#bots-blocked) if that's all you're here for.
## Criteria for bad-bot blocking Bots I block fall into one of the following categories: - Bots that only serve to power adtech on _other_ sites. My site has no ads, but [I allow bots such as Google's AdsBot](#adsbot). - Intellectual property snitches. I forbid robots that scan for plagiarism, trademark/copyright violations, brand protection, etc. - Robots that power invasive background checks that border on cyberstalking. - Scrapers that build datasets to train Generative AI (GenAI), such as large language models (LLMs). I don't block search clients used by GenAI research assistants; I only block scrapers used to _train_ GenAI models. ### On contradictions between intellectual-property snitches and GenAI scrapers Some people think opposing GenAI scraping is support for intellectual property, and opposing intellectual property means support for GenAI scraping. My distaste for modern intellectual property protections comes from seeing how they stifle creativity. My view on the relationship between IP and GenAI scrapers us that much of (not all of) IP law _and_ GenAI are bad for creators. Policies on scraping, such as parts of the EU "CDSM Directive" relevant to the Text and Data Mining Protocol (discussed later), have the potential to strike the right balance between them on the Web. I do not welcome copyright snitches that work against remix culture, and GenAI scrapers that devalue and replace creative labor (remix or not), on seirdy.one. ## How I block bots Here's how I block bots, from the most polite approaches to the most severe restrictions. ### X-Robots `X-Robots` tags control _what to do with a page after it has been crawled._ They do _not_ opt out of mere crawling: they opt-out of processing pages after they've been discovered by a crawler. I set `X-Robots` tags in every page that forbid training Generative AI algorithms but accept just about everything else. [Google Search Central lists its supported directives](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag). {{}}DeviantArt popularized the `NoAI` `X-Robots` tag in {{}}{{}}, wich [Cohost](https://web.archive.org/web/20241207040446/https://cohost.org/staff/post/272195-cohost-now-sets-devi) and [Misskey](https://github.com/misskey-dev/misskey/pull/10833) since implemented. The [img2dataset scraper](https://github.com/rom1504/img2dataset/pull/218) respects it. ### robots.txt robots.txt is meant to opt out of crawling, to reduce server load. It does _not_ opt you out of further processing of crawled pages. Data miners can still fetch your pages without crawling them: they can fetch archived snapshots, use data collection in users' browsers or browser extensions, download or buy datasets, etc. `X-Robots` tags are the only standard vendor-neutral format for opting out of processing of crawled pages. **There is no way to opt out of both crawling and further processing.** A page's policies on permitted forms of processing can only be found by visiting the page and reading those policies in `X-Robots` tags. That said, many robots do not respect `X-Robots` tags, or an `X-Robots` tag doesn't exist to prevent the type of processing I want to prevent (e.g. IP-snitching). Often, a robots.txt entry is the only documented way to opt out of something. In these cases, I compromise and hijack my robots.txt to opt out. ### Comparing robots.txt, X-Robots, and the TDM Reservation Protocol Ideally, `X-Robots` would support per-engine rules in a standard format, such as robots.txt. Unfortunately, it doesn't. Unlike robots.txt, it has per-page rules that persist on archives of a given page, potentially saved in other data sets. A successor to `X-Robots` called the [Text and Data Mining (TDM) Reservation Protocol](https://w3c.github.io/tdm-reservation-protocol/) may close the gap between `X-Robots` and robots.txt. It allows setting policies on data-mining to forbid LLM training, among other things. The `noai` `X-Robots` directive makes the same mistake as the "Do Not Track" HTTP request header, in that it's a vague polite request with no incentive for compliance. Crawlers generally respect robots.txt because they get caught and banned if they don't; indexers respect `noindex` because it prevents duplicate, internal, and shallow pages from getting indexed. There's no incentive to respect `noai`. The TDM Reservation Protocol is less like "Do Not Track" and more like Global Privacy Control: observance of the TDM Reservation Protocol could be a way to qualify for an exception under the European Union's "CDSM Directive". It could have a clear legal meaning and legal incentive, which gives companies a very good reason to respect it. After the TDM Reservation Protocol finalizes and is shown to work, I may adopt it. ### User-Agent regex denial I block certain bots by matching their User-Agent strings against a regular expression and serving them a "403: Forbidden" HTTP error. I don't want to welcome these robots on seirdy.one. They typically don't obey (or have a history of disobeying) `X-Robots` or robots.txt. **This is a last resort.** I _need_ well-behaved robots to read my `X-Robots` tags and robots.txt so they can know that I'm opting out of data collection. By doing so, they can also skip my site when scraping third-party datasets and drop existing copies of my pages from their datasets. ## Currently-relevant bots I block in robots.txt {#bots-blocked} Here's the list of bots, their product tokens, and information on the bots. A product token populates the `User-Agent` line in a robots.txt entry. It doesn't necessarily map to the robots' `User-Agent` HTTP request headers, but it should have a strong resemblance. {{}}
The product token **SHOULD** be a substring of the identification string that the crawler sends to the service. For example, in the case of HTTP RFC9110, the product token **SHOULD** be a substring in the User-Agent header.
{{}} {{RFC 9309: Robots Exclusion Protocol" extraName="headline" url="https://www.rfc-editor.org/rfc/rfc9309.html">}} {{}} {{
}} ### peer39 crawler Product tokens : `peer39_crawler` : `peer39_crawler/1.0` Reason : Adtech Information : {{}}{{}}{{}} : Relevant quote: By allowing us access, you enable the maximum number of advertisers to confidently purchase advertising space on your pages. Our comprehensive data insights help advertisers understand the suitability and context of your content, ensuring that their ads align with your audience's interests and needs. This alignment leads to improved user experiences, increased engagement, and ultimately, higher revenue potential for your publication. ### TurnItInBot Product tokens : `turnitinbot` Reason : Intellectual property snitch. Information : {{}}{{}}{{}} : Relevant quote: This robot collects content from the Internet for the sole purpose of helping educational institutions prevent plagiarism. In particular, we compare student papers against the content we find on the Internet to see if we can find similarities. Commentary : TurnItIn also offers false-positive-ridden LLM-detection services, described in {{}}{{}} by {{}}. This forces students with original papers to adjust their writing styles, distancing themselves from LLM stylometric fingerprints.n Nn-native English speakers find this especially hard. ### BLEXBot Product tokens : `BLEXBot` Reason : An intersection of adtech and search engine optimization. The mention of detecting legal issues on the page was also concerning. Information : [BLEXBot information page (unencrypted, no TLS)](http://webmeup-crawler.com/) : [Archive of BLEXBot information page (with TLS)](https://web.archive.org/web/20241112175427/http://webmeup-crawler.com/) : Relevant quote: BLEXBot assists internet marketers to get information on the link structure of sites and their interlinking on the web, to avoid any technical and possible legal issues and improve overall online experience. Commentary : This one doesn't seem too major, compared to the others. I might have over-reacted. ### CheckMark Network's crawler Product tokens : `CheckMarkNetwork/1.0 (+https://www.checkmarknetwork.com/spider.html)` Reason : Intellectual property snitch Information : {{}}{{}}{{}} : Relevant homepage quotes: We provide comprehensive trademark monitoring and specialized watch packages in all jurisdictions. and The Internet is just way too big to effectively police alone. Commentary : The product token is awful. It should just be `CheckMarkNetwork`, a subset of the User-Agent string excluding variable version indicators and URLs. ### BrandVerity Product tokens : `BrandVerity/1.0` Reason : Intellectual property snitch Information : {{}}{{}}{{}} \[sic\] : {{}}{{}}{{}} : Relevant homepage quote: Maximize CTRs and minimize CPCs by identifying and removing harmful ads from appearing on your most popular branded keywords. Protect your affiliate channel from abuse by policing and taking down unauthorized ads. ### PiplBot Product tokens : `PiplBot` Reason : Invasive background checks Information : {{}}{{}}{{}} : {{}}{{}}{{}} : Relevant quote: Pipl assembles online identity information from multiple independent sources to create the most complete picture of a digital identity and connect it to real people and their offline identity records. When all the fragments of online identity data are collected, connected, and corroborated, the result is a more trustworthy identity. ### OpenAI GenAI training-data scrapers Product tokens : `GPTBot` Reason : GenAI training-data scraper Information : {{}}{{}}{{}} : Relevant quote: GPTBot is used to make our generative AI foundation models more useful and safe. It is used to crawl content that may be used in training our generative AI foundation models. Disallowing GPTBot indicates a site’s content should not be used in training generative AI foundation models. Commentary : The other two bots listed, `OAI-SearchBot` and `ChatGPT-User`, aren't used for training LLMs. If you don't want your work to train an OpenAI LLM, blocking them won't accomplish that. It could serve as a form of protest against the company. ### Google-Extended Product tokens : `Google-Extended` Reason : GenAI training-data opt-out signal, not an actual bot Information : {{}}{{}}{{}} : Relevant quote: `Google-Extended` is a standalone product token that web publishers can use to manage whether their sites help improve Gemini Apps and Vertex AI generative APIs, including future generations of models that power those products. Grounding with Google Search on Vertex AI does not use web pages for grounding that have disallowed Google-Extended. Google-Extended does not impact a site's inclusion or ranking in Google Search. ### Applebot-Extended Product tokens : `Applebot-Extended` Reason : GenAI training-data opt-out signal, not an actual bot Information : {{}}{{}}{{}} : Relevant quote: With Applebot-Extended, web publishers can choose to opt out of their website content being used to train Apple’s foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools. ### Anthropic's ClaudeBot Product tokens : `ClaudeBot` Reason : GenAI training-data scraper Information : {{}}{{}}{{}} Commentary : Anthropic AI was _extremely late_ in posting robots.txt guidance, after it already scraped plenty of data. It does seem well-behaved now. ### Meta GenAI training-data scrapers Product tokens : `FacebookBot` : `Meta-ExternalAgent` Reason : GenAI training-data scraper. : Extreme lack of trust in Meta/Facebook. Information : {{}}{{}}{{}} : {{}}{{}}{{}} : Relevant quotes: The Meta-ExternalAgent crawler crawls the web for use cases such as training AI models or improving products by indexing content directly. and FacebookBot crawls public web pages to improve language models for our speech recognition technology. Commentary : `FacebookBot` used to be the opt-out signal for GenAI data scraping. After hundreds of sites opted out, Meta quietly introduced `Meta-ExternalAgent` as their new GenAI data scraping product token and dropped GenAI from their descrption of `FacebookBot`. This effectively required everyone to learn how to opt-out once again; in that time, they scraped plenty of sites that had previously opted out. I block both product tokens for good measure. ### Cotoyogi Product tokens : `Cotoyogi` Reason : GenAI training-data scraper Information : [Bot info page](https://ds.rois.ac.jp/center8/crawler/) : [Parent page indicates it builds LLMs from the data](https://ds.rois.ac.jp/center8/) ### Webz.io Product tokens : `Webzio-extended` Reason : GenAI training-data opt-out signal, not an actual bot Information : {{}}{{}}{{}} : Relevant quote: This bot takes things a step further. It analyzes your content to see if you’ve indicated it’s forbidden for AI usage, letting big data applications know if the data is allowed for AI purposes. This indicator will be clearly marked on the data itself and reflected in our Terms of Service. ### Kangaroo Bot Product tokens : `Kangaroo Bot` Reason : GenAI training-data scraper Information : {{}}{{}}{{}} ### Open Web Search GenAI-targeted opt=out Product tokens : `GenAI` Reason : GenAI training-data opt-out signal, not an actual bot Information : {{}}{{}}{{}} : Relevant quote: we also support the user-agent identifier GenAI, representing any data use for the purposes of training generative AI models...OpenWebSearch.EU forwards any information about the publishers' usage preferences to the users of our web index and all additional data products we publish through an INDEX as well as a GENAI Metadata field, both represented as boolean values. Commentary : I'm actually a fan of the Open Web Search project. It's important to have open indexes of the Web such as the Common Crawl and Open Web Search for projects like upstart search engines, and for research. This is a signal visible to all users of the Open Web Search project's index indicating that I don't want to train GenAI. I very much want Owler to visit my site! ## Certain Semrush product tokens Product tokens : `SemrushBot-OCOB` : `SemrushBot-FT` Reason : Intellectual property snitch. : LLM-based slop generation tool I do not wish to support. Information : {{}}{{}}{{}} : Relevant quote: block SemrushBot-OCOB from crawling your site for ContentShake AI tool and block SemrushBot-FT from crawling your site for Plagiarism Checker and similar tools ## Velen Product tokens : `VelenPublicWebCrawler` Reason : GenAI training-data scraper Information : {{}}{{}}{{}} : Relevant quote: Our goal with this crawler is to build business datasets and machine learning models to better understand the web. ## Historical bots I might not block It's harmless to block these, but I might stop including some of these lest others cargo-cult obsolete entries. ### NPBot Product tokens : `NPBot` Reason : Intellectual property snitch. Information : [Archived NameProtect bot info page](https://web.archive.org/web/20031008144550/http://nameprotect.com/botinfo.html) ### SlySearch/iThenticate Product tokens : `SlySearch` Reason : A predecessor of TurnItInBot. Information : [Archived copy of SlySearch homepage](https://web.archive.org/web/20060207051223/http://www.slysearch.com:80/static/home.html). ## Bots I ban with User-Agent rules These robots either don't respect robots.txt, have a history of disrespecting it, or have a sufficiently bad reputation. I withheld some entries that I block for more personal reasons. I use case-insensitive regular expression rules to ban their traffic on the server/reverse-proxy level. Be aware that version numbers in User-Agent strings may change. {{}}{{}} In my Nginx configs, I set the `$blocked_agent` variable on bad user agents. A safe `if` directive against `$blocked_agent` returns a 403 HTTP error. Here's a tiny sample of how I set `$blocked_agent` in the `http` context: {{}} ```figure map $http_user_agent $blocked_agent { default 0; ~*(?:ai[0-9]bot|allenai\.org) 1; ~*(?:chatglm.spider|chatglm\.cn) 1; } ``` {{}} ### ChatGLM User-Agent strings : Mozilla/5.0 (compatible; ChatGLM-Spider/1.0; +https://chatglm.cn/) Information : [Official ChatGLM website](https://chatglm.cn/) Regex filter : `chatglm.spider|chatglm\.cn` Commentary : Only hit my site twice. No documentation on robots.txt compliance. ### Ai2Bot User-Agent strings : Mozilla/5.0 (compatible) AI2Bot (+https://www.allenai.org/crawler) : Mozilla/5.0 (compatible) Ai2Bot-Dolma (+https://www.allenai.org/crawler) Information : {{}}{{}}{{}} : Relevant quotes: The AI2 Bot explores certain domains to find web content. This web content is used to train open language models and This user agent string can be used to filter or reject traffic from our crawler if desired. Regex filter : `ai[0-9]bot|allenai\.org` Commentary : This robot used a _slightly_ different User-Agent string to reach my site than the one advertised on the crawling notice (I saw a -Dolma suffix in my logs). ## FriendlyCrawler User-Agent strings : Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/605.1.15 (KHTML, like Gecko; compatible; FriendlyCrawler/1.0) Chrome/120.0.6099.216 Safari/605.1.15 Information : {{}}{{}} by {{}} has good coverage of this bot. ## PerplexityBot User-Agent strings : Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://docs.perplexity.ai/docs/perplexity-bot) Product tokens : `PerplexityBot` Information : {{}}{{}}{{}} Commentary : {{}}{{}}, disobeying robots.txt in the process too. {{}}WIRED confirmed these findings in {{}}{{}}. Perplexity doesn't train new LLMs, but it has lost enough of my trust. ## Exceptions: scrapers I allow, despite meeting some block-criteria ### AdsBot Advertising-averse webmasters often block Google's AdsBot and its cousins. I allow it, but place the following [well-known resources](https://en.wikipedia.org/wiki/Well-known_URI) in my site's root: - `/ads.txt` - `/app-ads.txt` (possibly redundant) These files contain the list of authorized ad sellers for a domain. I use a standard placeholder to declare that there exist _no_ authorized ad sellers for `seirdy.one`. Any client-side malware that injects ads into my site won't make money via ad networks adhering to this specification. {{}}{{}} These are the contents of my `/ads.txt` and `/app-ads.txt`. {{}} ```figure placeholder.example.com, placeholder, DIRECT, placeholder ``` {{}} You can read more in the ads.txt specification on {{}}{{}} by the IAB Tech Lab{{}}. I learned about this technique via [an article from GrapheneOS describing their use of `/ads.txt`](https://grapheneos.org/articles/sitewide-advertising-industry-opt-out). ### CCBot and Owler I allow CCBot, the robot powering [Common Crawl](http://commoncrawl.org/). Common Crawl crawls the web to build a public index that anybody can use. People often use it for research; the [Web Data Commons](https://webdatacommons.org/) is my favorite example. People also use it to build new search engines, since building an index is one of the hardest parts of starting one: [Alexandria and initial versions of Stract]({{}}) use the Common Crawl as their index (Stract has since switched to its own crawler). Ever since Google used Common Crawl to power initial versions of Bard's LLM (later renamed to Gemini, Google's brand name shared across all its generative-AI products), many sites started to block CCBot in their robots.txt file. I understand why: an open index for anybody to use also invites use for LLM training, so sites blocking LLM training-data scrapers would want to block CCBot. I personally find the non-GenAI utility of Common Crawl too large to ignore. The Open Web Search Project [shares similar goals with Common Crawl](https://openwebsearch.eu/common-goals-with-common-crawl/), so I allow [their "Owler" crawler](https://openwebsearch.eu/owler/) for similar reasons. I block their `GenAI` product token only to add metadata to their data set indicating to users that I'm opting out of GenAI training. ### Some SEO crawlers: MJ12bot, AhrefsBot, etc. I frown on most Search Engine Optimization (SEO) services, given how they've contributed to turning the most reachable Web pages into "content" that optimizes for ranking well on search algorithms at the cost of having genuine, useful writing that shows a love for a topic. I call this phenomenon [the "Living Dead Internet Theory"]({{}}). I still permit most SEO crawlers, though. Many of them have other uses. MJ12bot powers Majestic's web index, which Majestic uses for [historical exploration](https://blog.majestic.com/company/new-15-year-plus-historic-index/), a (now offline?) search engine, and [The Majestic Million](https://majestic.com/reports/majestic-million). The Majestic Million also partially powers [Tranco](https://tranco-list.eu/), which in turn supports a large body of research. Ahrefs is another SEO company that uses AhrefsBot to find SEO backlinks, but also started using its index for the Yep.com search engine. For the ones that don't have other uses, **I don't have an issue with backlink-finding. I have issues with other SEO tools,** such as keyword research and analytics to optimize "content" for engagement rather than to write about what an author feels like sharing.[^2] My understanding is that crawling mostly powers backlink-finding. ### LLM clients GenAI companies use these bots to power LLM-based tools. They don't _train_ GenAI but function as clients _of_ LLMs to e.g. fetch search results. These include: - `cohere-ai` - OpenAI's `OAI-SearchBot`, `ChatGPT-User` - `YouBot` of You.com. I am not sure about this one. ## To-do I need to investigate Diffbot. It fetches data for a commercial dataset useful for LLM training; it also powers article extraction for services such as Instapaper. I need to weigh the range of its uses as I did for CCBot before I make a decision. I should invest more time thinking about bots that don't fit my current criteria but aren't well-behaved (too many requests, not obeying robot directives such as `X-Robots` or robots.txt). Even if my site can weather their load, I should still use this page to alert others about them if possible.
## Conclusion I hope this helps more webmasters learn not just about certain bots worth blocking, but also about the nuance involved. I split things across robots.txt, `X-Robots`, and banned User-Agent strings in my server configs. I allow certain often-blocked bots to slip through.
## Ack­nowledge­ments and other resources {#acknowledgements} I found most bots via my server logs, but some third party collections proved helpful. [Dark Visitors](https://darkvisitors.com/) : I don't recommend using this as an authoritative source, as it's not terribly detailed and invests little nuance in each entry's explanation. Several inaccurate entries and false-positives exist. Still, Dark Visitors is useful resource for anybody willing to fact-check what they find. [ai.robots.txt](https://github.com/ai-robots-txt/ai.robots.txt) : Like Dark Visitors, this is a repository of information about GenAI-related robots with some inaccuracies and false-positives. Fact-check any findings before accepting them. [Matomo's Device Detector repository](https://github.com/matomo-org/device-detector) : Browsing Device Detector's `bots.yml` using `yq` helped me learn about a few entries. [Cloudflare Verified Bots](https://radar.cloudflare.com/traffic/verified-bots) : I don't recall actually discovering a bot to block via Cloudflare Radar, but I'm linking it here as it's a helpful bot catalog in general. Be aware of errors in its bot categorization. Several bots listed under "AI Crawler" aren't actually training or even related to GenAI. Some "Search Engine Crawler" entries are non-crawling scrapers that only fetch favicons for search results. This is a good starting point for further research if you fact-check entries yourself. Helpful individuals : Thanks to {{}} for telling me about Open Web Search adding a `GenAI` product token for opting out of GenAI training, and about SemRush's array of product tokens.
[^1]: I also got some secondhand embarrassment from some sites that copied my robots.txt file verbatim, even preserving _my_ sitemap URL at the end! [^2]: I know "not all analytics" optimize engagement. Not all analytics are problematic for the same reason; different kinds of analytics are problematic for different reasons.