syndicate

New note: opting out of LLM indexing
syndicate
2024-12-17 22:32:10 +00:00 · 2023-04-21 22:45:43 -07:00 · 2023-04-21 22:40:04 -07:00 · 2023-04-20 14:41:17 -07:00 · 2023-04-20 14:36:16 -07:00
2 changed files with 40 additions and 0 deletions
--- a/content/notes/opting-out-of-llm-indexing.md
+++ b/content/notes/opting-out-of-llm-indexing.md
@ -0,0 +1,17 @@
 ---
 title: "Opting out of LLM indexing"
 date: 2023-04-21T22:40:04-07:00
 replyURI: "https://chriscoyier.net/2023/04/21/the-secret-list-of-websites/"
 replyTitle: "“the secret list of websites”"
 replyType: "BlogPosting"
 replyAuthor: "Chris Coyier"
 replyAuthorURI: "https://chriscoyier.net/"
 syndicatedCopies:
    - title: 'The Fediverse'
      url: 'https://pleroma.envs.net/notice/AUttq9kpOmeYZDHRTc'
 ---
 I added an entry to [my robots.txt](https://seirdy.one/robots.txt) to block ChatGPT's crawler, but blocking crawling isn't the same as blocking indexing; it looks like Google chose to use the [Common Crawl](https://commoncrawl.org/) for this and sidestep the need to do crawling of its own. That's a strange decision; after all, Google has a much larger proprietary index at its disposal.
 A "secret list of websites" was an ironic choice of words, given that this originates from the Common Crawl. It's sad to see Common Crawl (ab)used for this, but I suppose we should have seen it coming.
 I know Google tells authors how to qualify/disqualify from rich results, but I don't see any docs for opting a site out of LLM/Bard training.
--- a/content/notes/re-automated-workflows-for-websites.md
+++ b/content/notes/re-automated-workflows-for-websites.md
@ -0,0 +1,23 @@
 ---
 title: "Re: automated workflows for websites"
 date: 2023-04-20T14:36:16-07:00
 replyURI: "https://blog.lea.lgbt/posts/2023-04-20-automated-workflows-for-websites/"
 replyTitle: "Automated workflows for websites"
 replyType: "BlogPosting"
 replyAuthor: "Lea Rosema"
 replyAuthorURI: "https://blog.lea.lgbt/"
 syndicatedCopies:
    - title: 'The Fediverse'
      url: 'https://pleroma.envs.net/notice/AUr8PE6SK6jXl3XaE4'
 ---
 This is so similar to [my setup!]({{<relref "/meta/_index.md">}}) I run Stylelint and v.Nu too. I [send v.Nu output through a JQ filter](https://git.sr.ht/~seirdy/seirdy.one/tree/master/item/linter-configs/vnu_filter.jq) to filter out false-positives (after reporting them upstream); you might eventually do something similar, since there are a _lot_ of these. Your blog post reminds me that I need something better than regex substitutions for customizing footnote and section links; Hugo's parallel nature prevents it from doing post-processing of fully-assembled pages. Other tools I use:
 - `xmllint` to validate that the markup is well-formed XHTML5 syntax; it runs much more quickly than v.Nu and does light auto-formatting, but is also more limited.
 - There's also [a W3C feed validator](https://github.com/w3c/feedvalidator) written in Python worth checking out; I send my Atom feeds through that.
 - I run `axe-core`, IBM Equal Access checker, and Webhint on every page with headless browsers.
 - In the future: I'll need to figure out a good workflow for easily validating JSON according to a schema, and adding some Microformats + Microdata validation too (maybe using Schemara?).
 The whole thing takes several minutes to run, so I don't run it every commit; just building my site (no linting or validation) requires only a tarball with some statically-linked binaries. It's more in line with the ["built to last"](https://jeffhuang.com/designed_to_last/) philosophy; I'm curious if you have any thoughts about it.
Author	SHA1	Message	Date
Rohan Kumar	e4592387a3	syndicate	2023-04-21 22:45:43 -07:00
Rohan Kumar	2a8d60b896	New note: opting out of LLM indexing	2023-04-21 22:40:04 -07:00
Rohan Kumar	fb66d67114	syndicate	2023-04-20 14:41:17 -07:00
Rohan Kumar	2d230e21db	New note: re: automated workflows for websites	2023-04-20 14:36:16 -07:00