1
0
Fork 0
mirror of https://git.sr.ht/~seirdy/seirdy.one synced 2024-11-23 12:52:10 +00:00

syndicate

This commit is contained in:
Seirdy 2024-07-24 14:33:30 -04:00
parent 03d504e79a
commit 8443cb781d
No known key found for this signature in database
GPG key ID: 1E892DB2A5F84479

View file

@ -6,9 +6,9 @@ replyTitle: "We really fucking need a community owned search engine index that f
replyType: "SocialMediaPosting" replyType: "SocialMediaPosting"
replyAuthor: "Marta Sokolska" replyAuthor: "Marta Sokolska"
replyAuthorURI: "https://me.pionaiki.com/" replyAuthorURI: "https://me.pionaiki.com/"
#syndicatedCopies: syndicatedCopies:
# - title: 'The Fediverse' - title: 'The Fediverse'
# url: '' url: 'https://pleroma.envs.net/objects/48afb46f-0c40-46cb-8e48-867032d693aa'
--- ---
Common Crawl is the closest thing we have to an open index, though it doesn't meet your requirement of ignoring robots.txt for corporate websites while obeying it for personal sites. Unfortunately, being open and publicly available means that people use it to train LLMs. ([Google did this for initial versions of Bard]({{<relref "/notes/opting-out-of-llm-indexing.md">}})), so a lot of sites block its crawler. Most robots.txt guides for blocking GenAI crawlers include an entry for it now. Common Crawl is the closest thing we have to an open index, though it doesn't meet your requirement of ignoring robots.txt for corporate websites while obeying it for personal sites. Unfortunately, being open and publicly available means that people use it to train LLMs. ([Google did this for initial versions of Bard]({{<relref "/notes/opting-out-of-llm-indexing.md">}})), so a lot of sites block its crawler. Most robots.txt guides for blocking GenAI crawlers include an entry for it now.