1
0
Fork 0
mirror of https://git.sr.ht/~seirdy/seirdy.one synced 2024-12-24 17:52:11 +00:00

add nocache robots tag

This commit is contained in:
Seirdy 2024-12-11 10:31:56 -05:00
parent d46e050433
commit b7eaf6ddd9
No known key found for this signature in database
GPG key ID: 1E892DB2A5F84479
2 changed files with 9 additions and 2 deletions

View file

@ -61,6 +61,13 @@ I set `X-Robots` tags in every page that forbid training Generative AI algorithm
{{<mention-work itemtype="BlogPosting">}}<span itemscope="" itemprop="publisher" itemtype="https://schema.org/Organization">DeviantArt</span> popularized the `NoAI` `X-Robots` tag in {{<cited-work name="UPDATE All Deviations Are Opted Out of AI Datasets" url="https://www.deviantart.com/team/journal/UPDATE-All-Deviations-Are-Opted-Out-of-AI-Datasets-934500371" extraName="headline">}}{{</mention-work>}}, wich [Cohost](https://web.archive.org/web/20241207040446/https://cohost.org/staff/post/272195-cohost-now-sets-devi) and [Misskey](https://github.com/misskey-dev/misskey/pull/10833) since implemented. The [img2dataset scraper](https://github.com/rom1504/img2dataset/pull/218) respects it.
In September 2024, Bing announced support for a `nocache` robots directive and hijacked the existing `noarchive` directive.
- `nocache` allows Microsoft to do <abbr>LLM</abbr> training only using search engine result titles and snippets, and preserves visibility in Bing Chat.
- `noarchive` completely opts a site out of Bing Chat and Microsoft's <abbr>LLM</abbr> training.
I adopted `nocache`, as I still want my site to support real archiving services.
### <span translate="no">robots.txt</span>
<span translate="no">robots.txt</span> is meant to opt out of crawling, to reduce server load. It does _not_ opt you out of further processing of crawled pages. Data miners can still fetch your pages without crawling them: they can fetch archived snapshots, use data collection in users' browsers or browser extensions, download or buy datasets, etc. `X-Robots` tags are the only standard vendor-neutral format for opting out of processing of crawled pages.

View file

@ -16,9 +16,9 @@
<!-- Only index the canonical locations, not the envs.net mirror. -->
{{ if or (eq (trim site.BaseURL "/") site.Params.CanonicalBaseURL) (in site.BaseURL "wgq3bd2kqoybhstp77i3wrzbfnsyd27wt34psaja4grqiezqircorkyd.onion") -}}
<!-- See https://noml.info/, https://www.deviantart.com/team/journal/UPDATE-All-Deviations-Are-Opted-Out-of-AI-Datasets-934500371 -->
<meta name="robots" content="index,follow,max-image-preview:large,max-snippet:-1,noai,noimageai,noml" />
<meta name="robots" content="index,follow,max-image-preview:large,max-snippet:-1,noai,noimageai,nocache" />
{{ else -}}
<meta name="robots" content="noindex,nofollow,noimageindex,noai,noimageai" />
<meta name="robots" content="noindex,nofollow,noimageindex,noai,noimageai,nocache" />
{{ end -}}
<link href="{{ .Site.Params.CanonicalBaseURL }}{{ $canonicalRelPermalink }}" rel="canonical" />
<link href="{{ .Site.Params.WebmentionEndpoint }}" rel="webmention" />