diff --git a/content/meta/scrapers-i-block.md b/content/meta/scrapers-i-block.md index d4e98e1..fc97e6b 100644 --- a/content/meta/scrapers-i-block.md +++ b/content/meta/scrapers-i-block.md @@ -61,6 +61,13 @@ I set `X-Robots` tags in every page that forbid training Generative AI algorithm {{}}DeviantArt popularized the `NoAI` `X-Robots` tag in {{}}{{}}, wich [Cohost](https://web.archive.org/web/20241207040446/https://cohost.org/staff/post/272195-cohost-now-sets-devi) and [Misskey](https://github.com/misskey-dev/misskey/pull/10833) since implemented. The [img2dataset scraper](https://github.com/rom1504/img2dataset/pull/218) respects it. +In September 2024, Bing announced support for a `nocache` robots directive and hijacked the existing `noarchive` directive. + +- `nocache` allows Microsoft to do LLM training only using search engine result titles and snippets, and preserves visibility in Bing Chat. +- `noarchive` completely opts a site out of Bing Chat and Microsoft's LLM training. + +I adopted `nocache`, as I still want my site to support real archiving services. + ### robots.txt robots.txt is meant to opt out of crawling, to reduce server load. It does _not_ opt you out of further processing of crawled pages. Data miners can still fetch your pages without crawling them: they can fetch archived snapshots, use data collection in users' browsers or browser extensions, download or buy datasets, etc. `X-Robots` tags are the only standard vendor-neutral format for opting out of processing of crawled pages. diff --git a/layouts/partials/head.html b/layouts/partials/head.html index 6bd2f85..4e5237f 100644 --- a/layouts/partials/head.html +++ b/layouts/partials/head.html @@ -16,9 +16,9 @@ {{ if or (eq (trim site.BaseURL "/") site.Params.CanonicalBaseURL) (in site.BaseURL "wgq3bd2kqoybhstp77i3wrzbfnsyd27wt34psaja4grqiezqircorkyd.onion") -}} - + {{ else -}} - + {{ end -}}