1
0
Fork 0
mirror of https://git.sr.ht/~seirdy/seirdy.one synced 2024-12-26 18:32:10 +00:00
seirdy.one/content/meta/scrapers-i-block.md
2024-12-10 00:47:37 -05:00

34 KiB
Raw Blame History

title description date tags outputs syndicatedCopies
Scrapers I block (and allow), with explanations Heres my thought process when deciding whether to block a scraper from seirdy.one, the scrapers I block, the scrapers I allow, and the ways I block them. 2024-12-10T00:37:54-05:00
web
AI
html
title url
The Fediverse https://pleroma.envs.net/objects/c022c39f-5cab-49c8-bf84-7504326e87dc
title url
BlueSky https://bsky.app/profile/seirdy.one/post/3lcwjwqkc5k2z
title url
jstpst https://jstpst.net/f/articles/12091/scrapers-i-block-and-allow-with-explanations

Introduction

An old version of my robots.txt file was full of documentation on each user-agent I blocked. Over time, other websites started using my robots.txt file as a reference. I'm humbled by how useful others find my robots.txt file.1

Unfortunately, a plain text file in a browser window isn't the most readable experience for everybody. It has no headings for quick navigation, inconsistent support for soft line-wrapping, and poor support for hyperlinks.

I'm compiling documentation on the bots I block here, including my thought process for deciding whether to block or allow a bot. This isn't an alternative to larger resources on bad-bot lists, but a secondary resource for them to cite. If you compile a larger list of bots, feel free to cite this page.

At the same time, I hope to make this an alternative to generic bad-bot lists with more nuance. I even go so far as explaining why I allow certain commonly-blocked bots.

Skip to the list of bots I block if that's all you're here for.

Criteria for bad-bot blocking

Bots I block fall into one of the following categories:

  • Bots that only serve to power adtech on other sites. My site has no ads, but I allow bots such as Google's AdsBot.

  • Intellectual property snitches. I forbid robots that scan for plagiarism, trademark/copyright violations, brand protection, etc.

  • Robots that power invasive background checks that border on cyberstalking.

  • Scrapers that build datasets to train Generative AI (GenAI), such as large language models (LLMs). I don't block search clients used by GenAI research assistants; I only block scrapers used to train GenAI models.

On contradictions between intellectual-property snitches and GenAI scrapers

Some people think opposing GenAI scraping is support for intellectual property, and opposing intellectual property means support for GenAI scraping.

My distaste for modern intellectual property protections comes from seeing how they stifle creativity. My view on the relationship between IP and GenAI scrapers us that much of (not all of) IP law and GenAI are bad for creators. Policies on scraping, such as parts of the EU "CDSM Directive" relevant to the Text and Data Mining Protocol (discussed later), have the potential to strike the right balance between them on the Web. I do not welcome copyright snitches that work against remix culture, and GenAI scrapers that devalue and replace creative labor (remix or not), on seirdy.one.

How I block bots

Here's how I block bots, from the most polite approaches to the most severe restrictions.

X-Robots

X-Robots tags control what to do with a page after it has been crawled. They do not opt out of mere crawling: they opt-out of processing pages after they've been discovered by a crawler.

I set X-Robots tags in every page that forbid training Generative AI algorithms but accept just about everything else. Google Search Central lists its supported directives.

{{}}DeviantArt popularized the NoAI X-Robots tag in {{}}{{}}, wich Cohost and Misskey since implemented. The img2dataset scraper respects it.

robots.txt

robots.txt is meant to opt out of crawling, to reduce server load. It does not opt you out of further processing of crawled pages. Data miners can still fetch your pages without crawling them: they can fetch archived snapshots, use data collection in users' browsers or browser extensions, download or buy datasets, etc. X-Robots tags are the only standard vendor-neutral format for opting out of processing of crawled pages.

There is no way to opt out of both crawling and further processing. A page's policies on permitted forms of processing can only be found by visiting the page and reading those policies in X-Robots tags.

That said, many robots do not respect X-Robots tags, or an X-Robots tag doesn't exist to prevent the type of processing I want to prevent (e.g. IP-snitching). Often, a robots.txt entry is the only documented way to opt out of something. In these cases, I compromise and hijack my robots.txt to opt out.

Comparing robots.txt, X-Robots, and the TDM Reservation Protocol

Ideally, X-Robots would support per-engine rules in a standard format, such as robots.txt. Unfortunately, it doesn't. Unlike robots.txt, it has per-page rules that persist on archives of a given page, potentially saved in other data sets.

A successor to X-Robots called the Text and Data Mining (TDM) Reservation Protocol may close the gap between X-Robots and robots.txt. It allows setting policies on data-mining to forbid LLM training, among other things.

The noai X-Robots directive makes the same mistake as the "Do Not Track" HTTP request header, in that it's a vague polite request with incentive for compliance. Crawlers generally respect robots.txt because they get caught and banned if they don't; indexers respect noindex because it prevents duplicate, internal, and shallow pages from getting indexed. There's no incentive to respect noai.

The TDM Reservation Protocol is less like "Do Not Track" and more like Global Privacy Control: observance of the TDM Reservation Protocol could be a way to qualify for an exception under the European Union's "CDSM Directive". It could have a clear legal meaning and legal incentive, which gives companies a very good reason to respect it. After the TDM Reservation Protocol finalizes and is shown to work, I may adopt it.

User-Agent regex denial

I block certain bots by matching their User-Agent strings against a regular expression and serving them a "403: Forbidden" HTTP error. I don't want to welcome these robots on seirdy.one. They typically don't obey (or have a history of disobeying) X-Robots or robots.txt. This is a last resort. I need well-behaved robots to read my X-Robots tags and robots.txt so they can know that I'm opting out of data collection. By doing so, they can also skip my site when scraping third-party datasets and drop existing copies of my pages from their datasets.

Currently-relevant bots I block in robots.txt

Here's the list of bots, their product tokens, and information on the bots.

A product token populates the User-Agent line in a robots.txt entry. It doesn't necessarily map to the robots' User-Agent HTTP request headers, but it should have a strong resemblance.

{{}}

The product token SHOULD be a substring of the identification string that the crawler sends to the service. For example, in the case of HTTP RFC9110, the product token SHOULD be a substring in the User-Agent header.

{{}} {{RFC 9309: Robots Exclusion Protocol" extraName="headline" url="https://www.rfc-editor.org/rfc/rfc9309.html">}} {{}}

{{}}

peer39 crawler

Product tokens
peer39_crawler
peer39_crawler/1.0
Reason
Adtech
Information
{{}}{{}}{{}}
Relevant quote: By allowing us access, you enable the maximum number of advertisers to confidently purchase advertising space on your pages. Our comprehensive data insights help advertisers understand the suitability and context of your content, ensuring that their ads align with your audience's interests and needs. This alignment leads to improved user experiences, increased engagement, and ultimately, higher revenue potential for your publication.

TurnItInBot

Product tokens
turnitinbot
Reason
Intellectual property snitch.
Information
{{}}{{}}{{}}
Relevant quote: This robot collects content from the Internet for the sole purpose of helping educational institutions prevent plagiarism. In particular, we compare student papers against the content we find on the Internet to see if we can find similarities.
Commentary
TurnItIn also offers false-positive-ridden LLM-detection services, described in {{}}{{}} by {{}}{{}}. This forces students with original papers to adjust their writing styles, distancing themselves from LLM stylometric fingerprints.n Nn-native English speakers find this especially hard.

BLEXBot

Product tokens
BLEXBot
Reason
An intersection of adtech and search engine optimization. The mention of detecting legal issues on the page was also concerning.
Information
BLEXBot information page (unencrypted, no TLS)
Archive of BLEXBot information page (with TLS)
Relevant quote: BLEXBot assists internet marketers to get information on the link structure of sites and their interlinking on the web, to avoid any technical and possible legal issues and improve overall online experience.
Commentary
This one doesn't seem too major, compared to the others. I might have over-reacted.

CheckMark Network's crawler

Product tokens
CheckMarkNetwork/1.0 (+https://www.checkmarknetwork.com/spider.html)
Reason
Intellectual property snitch
Information
{{}}{{}}{{}}
Relevant homepage quotes: We provide comprehensive trademark monitoring and specialized watch packages in all jurisdictions. and The Internet is just way too big to effectively police alone.
Commentary
The product token is awful. It should just be CheckMarkNetwork, a subset of the User-Agent string excluding variable version indicators and URLs.

BrandVerity

Product tokens
BrandVerity/1.0
Reason
Intellectual property snitch
Information
{{}}{{}}{{}} [sic]
{{}}{{}}{{}}
Relevant homepage quote: Maximize CTRs and minimize CPCs by identifying and removing harmful ads from appearing on your most popular branded keywords. Protect your affiliate channel from abuse by policing and taking down unauthorized ads.

PiplBot

Product tokens
PiplBot
Reason
Invasive background checks
Information
{{}}{{}}{{}}
{{}}{{}}{{}}
Relevant quote: Pipl assembles online identity information from multiple independent sources to create the most complete picture of a digital identity and connect it to real people and their offline identity records. When all the fragments of online identity data are collected, connected, and corroborated, the result is a more trustworthy identity.

OpenAI GenAI training-data scrapers

Product tokens
GPTBot
Reason
GenAI training-data scraper
Information
{{}}{{}}{{}}
Relevant quote: GPTBot is used to make our generative AI foundation models more useful and safe. It is used to crawl content that may be used in training our generative AI foundation models. Disallowing GPTBot indicates a sites content should not be used in training generative AI foundation models.
Commentary
The other two bots listed, OAI-SearchBot and ChatGPT-User, aren't used for training LLMs. If you don't want your work to train an OpenAI LLM, blocking them won't accomplish that. It could serve as a form of protest against the company.

Google-Extended

Product tokens
Google-Extended
Reason
GenAI training-data opt-out signal, not an actual bot
Information
{{}}{{}}{{}}
Relevant quote: Google-Extended is a standalone product token that web publishers can use to manage whether their sites help improve Gemini Apps and Vertex AI generative APIs, including future generations of models that power those products. Grounding with Google Search on Vertex AI does not use web pages for grounding that have disallowed Google-Extended. Google-Extended does not impact a site's inclusion or ranking in Google Search.

Applebot-Extended

Product tokens
Applebot-Extended
Reason
GenAI training-data opt-out signal, not an actual bot
Information
{{}}{{}}{{}}
Relevant quote: With Applebot-Extended, web publishers can choose to opt out of their website content being used to train Apples foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools.

Anthropic's ClaudeBot

Product tokens
ClaudeBot
Reason
GenAI training-data scraper
Information
{{}}{{}}{{}}
Commentary
Anthropic AI was extremely late in posting robots.txt guidance, after it already scraped plenty of data. It does seem well-behaved now.

Meta GenAI training-data scrapers

Product tokens
FacebookBot
Meta-ExternalAgent
Reason
GenAI training-data scraper.
Extreme lack of trust in Meta/Facebook.
Information
{{}}{{}}{{}}
{{}}{{}}{{}}
Relevant quotes: The Meta-ExternalAgent crawler crawls the web for use cases such as training AI models or improving products by indexing content directly. and FacebookBot crawls public web pages to improve language models for our speech recognition technology.
Commentary
FacebookBot used to be the opt-out signal for GenAI data scraping. After hundreds of sites opted out, Meta quietly introduced Meta-ExternalAgent as their new GenAI data scraping product token and dropped GenAI from their descrption of FacebookBot. This effectively required everyone to learn how to opt-out once again; in that time, they scraped plenty of sites that had previously opted out. I block both product tokens for good measure.

Cotoyogi

Product tokens
Cotoyogi
Reason
GenAI training-data scraper
Information
Bot info page
Parent page indicates it builds LLMs from the data

Webz.io

Product tokens
Webzio-extended
Reason
GenAI training-data opt-out signal, not an actual bot
Information
{{}}{{}}{{}}
Relevant quote: This bot takes things a step further. It analyzes your content to see if youve indicated its forbidden for AI usage, letting big data applications know if the data is allowed for AI purposes. This indicator will be clearly marked on the data itself and reflected in our Terms of Service.

Kangaroo Bot

Product tokens
Kangaroo Bot
Reason
GenAI training-data scraper
Information
{{}}{{}}{{}}

Open Web Search GenAI-targeted opt=out

Product tokens
GenAI
Reason
GenAI training-data opt-out signal, not an actual bot
Information
{{}}{{}}{{}}
Relevant quote: we also support the user-agent identifier GenAI, representing any data use for the purposes of training generative AI models...OpenWebSearch.EU forwards any information about the publishers' usage preferences to the users of our web index and all additional data products we publish through an INDEX as well as a GENAI Metadata field, both represented as boolean values.
Commentary
I'm actually a fan of the Open Web Search project. It's important to have open indexes of the Web such as the Common Crawl and Open Web Search for projects like upstart search engines, and for research. This is a signal visible to all users of the Open Web Search project's index indicating that I don't want to train GenAI. I very much want Owler to visit my site!

Certain Semrush product tokens

Product tokens
SemrushBot-OCOB
SemrushBot-FT
Reason
Intellectual property snitch.
LLM-based slop generation tool I do not wish to support.
Information
{{}}{{}}{{}}
Relevant quote: block SemrushBot-OCOB from crawling your site for ContentShake AI tool and block SemrushBot-FT from crawling your site for Plagiarism Checker and similar tools

Velen

Product tokens
VelenPublicWebCrawler
Reason
GenAI training-data scraper
Information
{{}}{{}}{{}}
Relevant quote: Our goal with this crawler is to build business datasets and machine learning models to better understand the web.

Historical bots I might not block

It's harmless to block these, but I might stop including some of these lest others cargo-cult obsolete entries.

NPBot

Product tokens
NPBot
Reason
Intellectual property snitch.
Information
Archived NameProtect bot info page

SlySearch/iThenticate

Product tokens
SlySearch
Reason
A predecessor of TurnItInBot.
Information
Archived copy of SlySearch homepage.

Bots I ban with User-Agent rules

These robots either don't respect robots.txt, have a history of disrespecting it, or have a sufficiently bad reputation. I withheld some entries that I block for more personal reasons. I use case-insensitive regular expression rules to ban their traffic on the server/reverse-proxy level.

Be aware that version numbers in User-Agent strings may change.

{{}}{{}}

In my Nginx configs, I set the $blocked_agent variable on bad user agents. A safe if directive against $blocked_agent returns a 403 HTTP error. Here's a tiny sample of how I set $blocked_agent in the http context:

{{}}

map $http_user_agent $blocked_agent {
  default                           0;
  ~*(?:ai[0-9]bot|allenai\.org)     1;
  ~*(?:chatglm.spider|chatglm\.cn)  1;
}

{{}}

ChatGLM

User-Agent strings
Mozilla/5.0 (compatible; ChatGLM-Spider/1.0; +https://chatglm.cn/)
Information
Official ChatGLM website
Regex filter
chatglm.spider|chatglm\.cn
Commentary
Only hit my site twice. No documentation on robots.txt compliance.

Ai2Bot

User-Agent strings
Mozilla/5.0 (compatible) AI2Bot (+https://www.allenai.org/crawler)
Mozilla/5.0 (compatible) Ai2Bot-Dolma (+https://www.allenai.org/crawler)
Information
{{}}{{}}{{}}
Relevant quotes: The AI2 Bot explores certain domains to find web content. This web content is used to train open language models and This user agent string can be used to filter or reject traffic from our crawler if desired.
Regex filter
ai[0-9]bot|allenai\.org
Commentary
This robot used a slightly different User-Agent string to reach my site than the one advertised on the crawling notice (I saw a -Dolma suffix in my logs).

FriendlyCrawler

User-Agent strings
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/605.1.15 (KHTML, like Gecko; compatible; FriendlyCrawler/1.0) Chrome/120.0.6099.216 Safari/605.1.15
Information
{{}}{{}} by {{}}{{}} has good coverage of this bot.

PerplexityBot

User-Agent strings
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://docs.perplexity.ai/docs/perplexity-bot)
Product tokens
PerplexityBot
Information
{{}}{{}}{{}}
Commentary
{{}}{{}} wrote {{}}{{}}, disobeying robots.txt in the process too. {{}}WIRED confirmed these findings in {{}}{{}}.

Perplexity doesn't train new LLMs, but it has lost enough of my trust.

Exceptions: scrapers I allow, despite meeting some block-criteria

AdsBot

Advertising-averse webmasters often block Google's AdsBot and its cousins. I allow it, but place the following well-known resources in my site's root:

  • /ads.txt
  • /app-ads.txt (possibly redundant)

These files contain the list of authorized ad sellers for a domain. I use a standard placeholder to declare that there exist no authorized ad sellers for seirdy.one. Any client-side malware that injects ads into my site won't make money via ad networks adhering to this specification.

{{}}{{}}

These are the contents of my /ads.txt and /app-ads.txt.

{{}}

placeholder.example.com, placeholder, DIRECT, placeholder

{{}}

You can read more in the ads.txt specification on {{}}{{}} by the IAB Tech Lab{{}}. I learned about this technique via an article from GrapheneOS describing their use of /ads.txt.

CCBot and Owler

I allow CCBot, the robot powering Common Crawl. Common Crawl crawls the web to build a public index that anybody can use. People often use it for research; the Web Data Commons is my favorite example. People also use it to build new search engines, since building an index is one of the hardest parts of starting one: [Alexandria and initial versions of Stract]({{<relref "/posts/search-engines-with-own-indexes/#smaller-indexes-or-less-relevant-results">}}) use the Common Crawl as their index (Stract has since switched to its own crawler).

Ever since Google used Common Crawl to power initial versions of Bard's LLM (later renamed to Gemini, Google's brand name shared across all its generative-AI products), many sites started to block CCBot in their robots.txt file. I understand why: an open index for anybody to use also invites use for LLM training, so sites blocking LLM training-data scrapers would want to block CCBot. I personally find the non-GenAI utility of Common Crawl too large to ignore.

The Open Web Search Project shares similar goals with Common Crawl, so I allow their "Owler" crawler for similar reasons. I block their GenAI product token only to add metadata to their data set indicating to users that I'm opting out of GenAI training.

Some SEO crawlers: MJ12bot, AhrefsBot, etc.

I frown on most Search Engine Optimization (SEO) services, given how they've contributed to turning the most reachable Web pages into "content" that optimizes for ranking well on search algorithms at the cost of having genuine, useful writing that shows a love for a topic. I call this phenomenon [the "Living Dead Internet Theory"]({{<relref "/notes/living-dead-internet-theory/">}}). I still permit most SEO crawlers, though.

Many of them have other uses. MJ12bot powers Majestic's web index, which Majestic uses for historical exploration, a (now offline?) search engine, and The Majestic Million. The Majestic Million also partially powers Tranco, which in turn supports a large body of research. Ahrefs is another SEO company that uses AhrefsBot to find SEO backlinks, but also started using its index for the Yep.com search engine.

For the ones that don't have other uses, I don't have an issue with backlink-finding. I have issues with other SEO tools, such as keyword research and analytics to optimize "content" for engagement rather than to write about what an author feels like sharing.2 My understanding is that crawling mostly powers backlink-finding.

LLM clients

GenAI companies use these bots to power LLM-based tools. They don't train GenAI but function as clients of LLMs to e.g. fetch search results. These include:

  • cohere-ai
  • OpenAI's OAI-SearchBot, ChatGPT-User
  • YouBot of You.com. I am not sure about this one.

To-do

I need to investigate Diffbot. It fetches data for a commercial dataset useful for LLM training; it also powers article extraction for services such as Instapaper. I need to weigh the range of its uses as I did for CCBot before I make a decision.

I should invest more time thinking about bots that don't fit my current criteria but aren't well-behaved (too many requests, not obeying robot directives such as X-Robots or robots.txt). Even if my site can weather their load, I should still use this page to alert others about them if possible.

Conclusion

I hope this helps more webmasters learn not just about certain bots worth blocking, but also about the nuance involved. I split things across robots.txt, X-Robots, and banned User-Agent strings in my server configs. I allow certain often-blocked bots to slip through.

Ack­nowledge­ments and other resources

I found most bots via my server logs, but some third party collections proved helpful.

Dark Visitors
I don't recommend using this as an authoritative source, as it's not terribly detailed and invests little nuance in each entry's explanation. Several inaccurate entries and false-positives exist. Still, Dark Visitors is useful resource for anybody willing to fact-check what they find.
ai.robots.txt
Like Dark Visitors, this is a repository of information about GenAI-related robots with some inaccuracies and false-positives. Fact-check any findings before accepting them.
Matomo's Device Detector repository
Browsing Device Detector's bots.yml using yq helped me learn about a few entries.
Cloudflare Verified Bots
I don't recall actually discovering a bot to block via Cloudflare Radar, but I'm linking it here as it's a helpful bot catalog in general. Be aware of errors in its bot categorization. Several bots listed under "AI Crawler" aren't actually training or even related to GenAI. Some "Search Engine Crawler" entries are non-crawling scrapers that only fetch favicons for search results. This is a good starting point for further research if you fact-check entries yourself.
Helpful individuals
Thanks to {{}} for pointing me out to the Open Web Search adding a GenAI product token for opting out of GenAI training.

  1. I also got some secondhand embarrassment from some sites that copied my robots.txt file verbatim, even preserving my sitemap URL at the end! ↩︎

  2. I know "not all analytics" optimize engagement. Not all analytics are problematic for the same reason; different kinds of analytics are problematic for different reasons. ↩︎