1
0
Fork 0
mirror of https://git.sr.ht/~seirdy/seirdy.one synced 2024-12-26 02:22:09 +00:00
seirdy.one/static/robots.txt
Seirdy 0139e58f87
new page: scrapers i block
Also remove the comments in robots.txt in favor of directing people to
that page.
2024-12-10 00:37:54 -05:00

73 lines
1.2 KiB
Text

# Please see <https://seirdy.one/meta/scrapers-i-block/> for an explanation on almost every entry, including intentionally-excluded entries.
User-agent: *
Disallow: /noindex/
Disallow: /misc/
User-Agent: peer39_crawler/1.0
Disallow: /
## IP-violation scanners ##
User-Agent: TurnitinBot
Disallow: /
User-Agent: SlySearch
Disallow: /
User-Agent: BLEXBot
Disallow: /
User-agent: CheckMarkNetwork/1.0 (+https://www.checkmarknetwork.com/spider.html)
Disallow: /
User-agent: BrandVerity/1.0
Disallow: /
## Misc. icky stuff ##
User-agent: PiplBot
Disallow: /
# Well-known overly-aggressive bot that claims to respect robots.txt: http://mj12bot.com/
User-agent: MJ12bot
Crawl-Delay: 10
## Gen-AI data scrapers ##
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-Agent: FacebookBot
User-Agent: meta-externalagent
Disallow: /
User-agent: Cotoyogi
Disallow: /
User-agent: Webzio-extended
Disallow: /
User-agent: Kangaroo Bot
Disallow: /
User-Agent: GenAI
Disallow: /
User-Agent: SemrushBot-OCOB
User-Agent: SemrushBot-FT
Disallow: /
User-Agent: VelenPublicWebCrawler
Disallow: /
Sitemap: https://seirdy.one/sitemap.xml