1
0
Fork 0
mirror of https://git.sr.ht/~seirdy/seirdy.one synced 2024-11-23 21:02:09 +00:00

Compare commits

...

2 commits

Author SHA1 Message Date
Rohan Kumar
38e7dc36bc
Clarify FediNuke criteria 2023-07-08 13:43:13 -07:00
Rohan Kumar
91d679e0a3
New note: blocking certain bots 2023-07-06 10:37:21 -07:00
3 changed files with 34 additions and 2 deletions

View file

@ -0,0 +1,24 @@
---
title: "Blocking certain bots"
date: 2023-07-06T10:36:32-07:00
replyURI: "https://gist.github.com/dangovorenefekt/b187b30e59ed1b827515cdbc833bc1bf?permalink_comment_id=4611496#gistcomment-4611496"
replyTitle: "You miss the point"
replyType: "DiscussionForumPosting"
replyAuthor: "DanGovorenEfekt"
replyAuthorURI: "https://govorenefekt.com/"
syndicatedCopies:
- title: 'GitHub'
url: 'https://gist.github.com/dangovorenefekt/b187b30e59ed1b827515cdbc833bc1bf?permalink_comment_id=4621262#gistcomment-4621262'
- title: 'The Fediverse'
url: 'https://pleroma.envs.net/notice/AXQOGKrOajCIX9XgEC'
---
> I don't want my content on those sites in any form and I don't want my content to feed their algorithms. Using robot.txt assumes they will 'obey' it. But they may choose not to.
So you'd like to block the scrapers used by some large companies, most of which appear well-behaved (i.e. they observe `robots.txt` and `noindex` directives). Some organizations disclose IPs they use, but you are banking on their honesty.
If we assume that trillion-dollar organizations are dishonest about how they scrape: they can spoof their user-agent, TLS fingerprint, IPs, etc. and use a headless browser very easily. There isn't really a way to protect yourself from this without also excluding real users (e.g. invasive/inaccessible CAPTCHAs for which workarounds exist).
They can get your content without scraping by downloading other data sets like the Common Crawl (Google did this for Bard), purchasing data sets from other vendors, or acquiring other companies with their own indexes.
The alternative is to assume they're at least somewhat honest about scraping content. If you use a `noindex` robots directive in your markup and HTTP headers but allow crawling, their crawlers will visit but won't index your site no matter what user-agent or data set they use. Check their webmaster documentation to double-check their support for these features.

View file

@ -17,7 +17,11 @@ A much smaller semi-curated subset of pleroma.envs.net suitable for the majority
=> https://seirdy.one/pb/FediNuke.txt FediNuke.txt
A curated subset of tier0.csv, containing what I deem the "worse half" of it. This contains instances I really do recommend most people block, or at least avoid. I try to make it a suitable candidate for a "default blocklist", and use it as reference when I evaluate the quality of other blocklists. This list is not comprehensive; to keep this list small, I excluded many really bad instances. I take into account not just severity, but also notoriety and liklihood of reaching/harming people on other instances (e.g. spewing toxicity in others' mentions, running blockbots, etc). Bad instances that mostly keep to themselves are less likely to cause problems for a new admin, and therefore less likely to get included in this minimal list.
A curated subset of tier0.csv, containing what I deem the "worse half" of it. This contains instances I really do recommend most people block, or at least avoid. I try to make it a suitable candidate for a "default blocklist", and use it as reference when I evaluate the quality of other blocklists.
This list is not comprehensive; to keep this list small, I excluded many really bad instances. I take into account not just severity, but also notoriety and liklihood of reaching/harming people on other instances (e.g. spewing toxicity in others' mentions, running blockbots, etc). Bad instances that mostly keep to themselves are less likely to cause problems for a new admin, and therefore less likely to get included in this minimal list.
Criteria are not set in stone. Instances well-known for causing significant problems for many other instances, particularly for instances run by and for marginalized groups, may be added.
Some of these lists are also sources for the Oliphant blocklists:
=> https://writer.oliphant.social/oliphant/the-blocklist-algorithm The Blocklist Algorithm - The Oliphant

View file

@ -21,7 +21,11 @@ I maintain three blocklists for the Fediverse:
: A much smaller semi-curated subset of `pleroma.envs.net` suitable for the majority of instances wishing to uphold the Mastodon Covenant's moderation standards. I hope to make it a good starting point for your instance's blocklist, with wiggle room for your own adjustments. Regularly importing `tier0.csv` won't account for retractions; [a sibling blocklist for tier-0 retractions](https://seirdy.one/pb/tier0-retractions.csv) exists for [FediBlockHole](https://github.com/eigenmagic/fediblockhole) users. Note that this list is larger than the bare-minimum I recommend; that'd be FediNuke.
[`FediNuke.txt`](https://seirdy.one/pb/FediNuke.txt)
: A curated subset of `tier0.csv`, containing what I deem the "worse half" of it. This contains instances I really do recommend most people block, or at least avoid. I try to make it a suitable candidate for a "default blocklist", and use it as reference when I evaluate the quality of other blocklists. This list is not comprehensive; to keep this list small, I excluded many really bad instances. I take into account not just severity, but also notoriety and liklihood of reaching/harming people on other instances (e.g. spewing toxicity in others' mentions, running blockbots, etc). Bad instances that mostly keep to themselves are less likely to cause problems for a new admin, and therefore less likely to get included in this minimal list.
: A curated subset of `tier0.csv`, containing what I deem the "worse half" of it. This contains instances I really do recommend most people block, or at least avoid. I try to make it a suitable candidate for a "default blocklist", and use it as reference when I evaluate the quality of other blocklists.
This list is not comprehensive; to keep this list small, I excluded many really bad instances. I take into account not just severity, but also notoriety and liklihood of reaching/harming people on other instances (e.g. spewing toxicity in others' mentions, running blockbots, etc). Bad instances that mostly keep to themselves are less likely to cause problems for a new admin, and therefore less likely to get included in this minimal list.
Criteria are not set in stone. Instances well-known for causing significant problems for many other instances, particularly for instances run by and for marginalized groups, may be added.
Some of these lists are also sources for the Oliphant blocklists. {{<mention-work itemtype="BlogPosting">}}{{<indieweb-person name="Oliphant" url="https://oliphant.social/@Oliphant" itemprop="author">}} describes them in his article {{<cited-work name="The Blocklist Algorithm" extraName="headline" url="https://writer.oliphant.social/oliphant/the-blocklist-algorithm">}}.{{</mention-work>}}