syndicate

New note: opting out of LLM indexing
syndicate
2024-11-23 21:02:09 +00:00 · 2023-04-21 22:45:43 -07:00 · 2023-04-21 22:40:04 -07:00 · 2023-04-20 14:41:17 -07:00 · 2023-04-20 14:36:16 -07:00
2 changed files with 40 additions and 0 deletions
--- a/content/notes/opting-out-of-llm-indexing.md
+++ b/content/notes/opting-out-of-llm-indexing.md
@ -0,0 +1,17 @@
+---
+title: "Opting out of LLM indexing"
+date: 2023-04-21T22:40:04-07:00
+replyURI: "https://chriscoyier.net/2023/04/21/the-secret-list-of-websites/"
+replyTitle: "“the secret list of websites”"
+replyType: "BlogPosting"
+replyAuthor: "Chris Coyier"
+replyAuthorURI: "https://chriscoyier.net/"
+syndicatedCopies:
+    - title: 'The Fediverse'
+      url: 'https://pleroma.envs.net/notice/AUttq9kpOmeYZDHRTc'
+---
+I added an entry to [my robots.txt](https://seirdy.one/robots.txt) to block ChatGPT's crawler, but blocking crawling isn't the same as blocking indexing; it looks like Google chose to use the [Common Crawl](https://commoncrawl.org/) for this and sidestep the need to do crawling of its own. That's a strange decision; after all, Google has a much larger proprietary index at its disposal.
+
+A "secret list of websites" was an ironic choice of words, given that this originates from the Common Crawl. It's sad to see Common Crawl (ab)used for this, but I suppose we should have seen it coming.
+
+I know Google tells authors how to qualify/disqualify from rich results, but I don't see any docs for opting a site out of LLM/Bard training.
--- a/content/notes/re-automated-workflows-for-websites.md
+++ b/content/notes/re-automated-workflows-for-websites.md
@ -0,0 +1,23 @@
+---
+title: "Re: automated workflows for websites"
+date: 2023-04-20T14:36:16-07:00
+replyURI: "https://blog.lea.lgbt/posts/2023-04-20-automated-workflows-for-websites/"
+replyTitle: "Automated workflows for websites"
+replyType: "BlogPosting"
+replyAuthor: "Lea Rosema"
+replyAuthorURI: "https://blog.lea.lgbt/"
+syndicatedCopies:
+    - title: 'The Fediverse'
+      url: 'https://pleroma.envs.net/notice/AUr8PE6SK6jXl3XaE4'
+---
+This is so similar to [my setup!]({{<relref "/meta/_index.md">}}) I run Stylelint and v.Nu too. I [send v.Nu output through a JQ filter](https://git.sr.ht/~seirdy/seirdy.one/tree/master/item/linter-configs/vnu_filter.jq) to filter out false-positives (after reporting them upstream); you might eventually do something similar, since there are a _lot_ of these. Your blog post reminds me that I need something better than regex substitutions for customizing footnote and section links; Hugo's parallel nature prevents it from doing post-processing of fully-assembled pages. Other tools I use:
+
+- `xmllint` to validate that the markup is well-formed XHTML5 syntax; it runs much more quickly than v.Nu and does light auto-formatting, but is also more limited.
+
+- There's also [a W3C feed validator](https://github.com/w3c/feedvalidator) written in Python worth checking out; I send my Atom feeds through that.
+
+- I run `axe-core`, IBM Equal Access checker, and Webhint on every page with headless browsers.
+
+- In the future: I'll need to figure out a good workflow for easily validating JSON according to a schema, and adding some Microformats + Microdata validation too (maybe using Schemara?).
+
+The whole thing takes several minutes to run, so I don't run it every commit; just building my site (no linting or validation) requires only a tarball with some statically-linked binaries. It's more in line with the ["built to last"](https://jeffhuang.com/designed_to_last/) philosophy; I'm curious if you have any thoughts about it.
Author	SHA1	Message	Date
Rohan Kumar	e4592387a3	syndicate	2023-04-21 22:45:43 -07:00
Rohan Kumar	2a8d60b896	New note: opting out of LLM indexing	2023-04-21 22:40:04 -07:00
Rohan Kumar	fb66d67114	syndicate	2023-04-20 14:41:17 -07:00
Rohan Kumar	2d230e21db	New note: re: automated workflows for websites	2023-04-20 14:36:16 -07:00