seirdy.one/content/notes/common-crawl-and-search-engines.md at 8b0ffb7dc96db70cf0ec369ae657fa559d3f7b62

fmaury/seirdy.one

Fork 0

mirror of https://git.sr.ht/~seirdy/seirdy.one synced 2024-11-14 17:42:09 +00:00

Seirdy 1166a400ca

Syndicate

2024-07-25 01:36:43 -04:00

1.5 KiB

Raw Blame History

title

date

replyURI

replyTitle

replyType

replyAuthor

replyAuthorURI

syndicatedCopies

Common Crawl and search engines

2024-07-24T14:23:35-04:00

https://corteximplant.net/objects/994acaa1-dd50-44b0-bb50-43211521a779

We really fucking need a community owned search engine index that fucks the corporate rules (but still respects personal websites' indexing preferences)

SocialMediaPosting

Marta Sokolska

https://me.pionaiki.com/

title	url
The Fediverse	https://pleroma.envs.net/objects/48afb46f-0c40-46cb-8e48-867032d693aa

title	url
The Mojeek Discourse	https://community.mojeek.com/t/common-crawl-and-search-engines/1136

Common Crawl is the closest thing we have to an open index, though it doesn't meet your requirement of ignoring robots.txt for corporate websites while obeying it for personal sites. Unfortunately, being open and publicly available means that people use it to train LLMs. [Google did this for initial versions of Bard]({{<relref "/notes/opting-out-of-llm-indexing.md">}}), so a lot of sites block its crawler. Most robots.txt guides for blocking GenAI crawlers include an entry for it now.

[Common Crawl powers Alexandria Search and was the basis of Stract's initial index]({{<relref "/posts/search-engines-with-own-indexes.md#smaller-indexes-or-less-relevant-results">}}), both of which are upstart FOSS engines.

A similar EU-focused project is OpenWebSearch/Owler.

1.5 KiB Raw Blame History

1.5 KiB

Raw Blame History