mirror of
https://git.sr.ht/~seirdy/seirdy.one
synced 2024-11-23 21:02:09 +00:00
Compare commits
2 commits
dcfb07715c
...
38e7dc36bc
Author | SHA1 | Date | |
---|---|---|---|
|
38e7dc36bc | ||
|
91d679e0a3 |
3 changed files with 34 additions and 2 deletions
24
content/notes/blocking-certain-bots.md
Normal file
24
content/notes/blocking-certain-bots.md
Normal file
|
@ -0,0 +1,24 @@
|
|||
---
|
||||
title: "Blocking certain bots"
|
||||
date: 2023-07-06T10:36:32-07:00
|
||||
replyURI: "https://gist.github.com/dangovorenefekt/b187b30e59ed1b827515cdbc833bc1bf?permalink_comment_id=4611496#gistcomment-4611496"
|
||||
replyTitle: "You miss the point"
|
||||
replyType: "DiscussionForumPosting"
|
||||
replyAuthor: "DanGovorenEfekt"
|
||||
replyAuthorURI: "https://govorenefekt.com/"
|
||||
syndicatedCopies:
|
||||
- title: 'GitHub'
|
||||
url: 'https://gist.github.com/dangovorenefekt/b187b30e59ed1b827515cdbc833bc1bf?permalink_comment_id=4621262#gistcomment-4621262'
|
||||
- title: 'The Fediverse'
|
||||
url: 'https://pleroma.envs.net/notice/AXQOGKrOajCIX9XgEC'
|
||||
---
|
||||
|
||||
> I don't want my content on those sites in any form and I don't want my content to feed their algorithms. Using robot.txt assumes they will 'obey' it. But they may choose not to.
|
||||
|
||||
So you'd like to block the scrapers used by some large companies, most of which appear well-behaved (i.e. they observe `robots.txt` and `noindex` directives). Some organizations disclose IPs they use, but you are banking on their honesty.
|
||||
|
||||
If we assume that trillion-dollar organizations are dishonest about how they scrape: they can spoof their user-agent, TLS fingerprint, IPs, etc. and use a headless browser very easily. There isn't really a way to protect yourself from this without also excluding real users (e.g. invasive/inaccessible CAPTCHAs for which workarounds exist).
|
||||
|
||||
They can get your content without scraping by downloading other data sets like the Common Crawl (Google did this for Bard), purchasing data sets from other vendors, or acquiring other companies with their own indexes.
|
||||
|
||||
The alternative is to assume they're at least somewhat honest about scraping content. If you use a `noindex` robots directive in your markup and HTTP headers but allow crawling, their crawlers will visit but won't index your site no matter what user-agent or data set they use. Check their webmaster documentation to double-check their support for these features.
|
|
@ -17,7 +17,11 @@ A much smaller semi-curated subset of pleroma.envs.net suitable for the majority
|
|||
|
||||
=> https://seirdy.one/pb/FediNuke.txt FediNuke.txt
|
||||
|
||||
A curated subset of tier0.csv, containing what I deem the "worse half" of it. This contains instances I really do recommend most people block, or at least avoid. I try to make it a suitable candidate for a "default blocklist", and use it as reference when I evaluate the quality of other blocklists. This list is not comprehensive; to keep this list small, I excluded many really bad instances. I take into account not just severity, but also notoriety and liklihood of reaching/harming people on other instances (e.g. spewing toxicity in others' mentions, running blockbots, etc). Bad instances that mostly keep to themselves are less likely to cause problems for a new admin, and therefore less likely to get included in this minimal list.
|
||||
A curated subset of tier0.csv, containing what I deem the "worse half" of it. This contains instances I really do recommend most people block, or at least avoid. I try to make it a suitable candidate for a "default blocklist", and use it as reference when I evaluate the quality of other blocklists.
|
||||
|
||||
This list is not comprehensive; to keep this list small, I excluded many really bad instances. I take into account not just severity, but also notoriety and liklihood of reaching/harming people on other instances (e.g. spewing toxicity in others' mentions, running blockbots, etc). Bad instances that mostly keep to themselves are less likely to cause problems for a new admin, and therefore less likely to get included in this minimal list.
|
||||
|
||||
Criteria are not set in stone. Instances well-known for causing significant problems for many other instances, particularly for instances run by and for marginalized groups, may be added.
|
||||
|
||||
Some of these lists are also sources for the Oliphant blocklists:
|
||||
=> https://writer.oliphant.social/oliphant/the-blocklist-algorithm The Blocklist Algorithm - The Oliphant
|
||||
|
|
|
@ -21,7 +21,11 @@ I maintain three blocklists for the Fediverse:
|
|||
: A much smaller semi-curated subset of `pleroma.envs.net` suitable for the majority of instances wishing to uphold the Mastodon Covenant's moderation standards. I hope to make it a good starting point for your instance's blocklist, with wiggle room for your own adjustments. Regularly importing `tier0.csv` won't account for retractions; [a sibling blocklist for tier-0 retractions](https://seirdy.one/pb/tier0-retractions.csv) exists for [FediBlockHole](https://github.com/eigenmagic/fediblockhole) users. Note that this list is larger than the bare-minimum I recommend; that'd be FediNuke.
|
||||
|
||||
[`FediNuke.txt`](https://seirdy.one/pb/FediNuke.txt)
|
||||
: A curated subset of `tier0.csv`, containing what I deem the "worse half" of it. This contains instances I really do recommend most people block, or at least avoid. I try to make it a suitable candidate for a "default blocklist", and use it as reference when I evaluate the quality of other blocklists. This list is not comprehensive; to keep this list small, I excluded many really bad instances. I take into account not just severity, but also notoriety and liklihood of reaching/harming people on other instances (e.g. spewing toxicity in others' mentions, running blockbots, etc). Bad instances that mostly keep to themselves are less likely to cause problems for a new admin, and therefore less likely to get included in this minimal list.
|
||||
: A curated subset of `tier0.csv`, containing what I deem the "worse half" of it. This contains instances I really do recommend most people block, or at least avoid. I try to make it a suitable candidate for a "default blocklist", and use it as reference when I evaluate the quality of other blocklists.
|
||||
|
||||
This list is not comprehensive; to keep this list small, I excluded many really bad instances. I take into account not just severity, but also notoriety and liklihood of reaching/harming people on other instances (e.g. spewing toxicity in others' mentions, running blockbots, etc). Bad instances that mostly keep to themselves are less likely to cause problems for a new admin, and therefore less likely to get included in this minimal list.
|
||||
|
||||
Criteria are not set in stone. Instances well-known for causing significant problems for many other instances, particularly for instances run by and for marginalized groups, may be added.
|
||||
|
||||
Some of these lists are also sources for the Oliphant blocklists. {{<mention-work itemtype="BlogPosting">}}{{<indieweb-person name="Oliphant" url="https://oliphant.social/@Oliphant" itemprop="author">}} describes them in his article {{<cited-work name="The Blocklist Algorithm" extraName="headline" url="https://writer.oliphant.social/oliphant/the-blocklist-algorithm">}}.{{</mention-work>}}
|
||||
|
||||
|
|
Loading…
Reference in a new issue