2020-11-30 21:06:44 +00:00
|
|
|
|
User-agent: *
|
2021-01-23 20:47:50 +00:00
|
|
|
|
Disallow: /noindex/
|
2021-06-11 22:09:43 +00:00
|
|
|
|
Disallow: /misc/
|
2022-04-23 04:45:15 +00:00
|
|
|
|
|
2024-05-06 21:44:22 +00:00
|
|
|
|
# I opt out of online advertising so malware that injects ads on my site won't
|
|
|
|
|
# get paid. You should do the same. my ads.txt file contains a standard
|
|
|
|
|
# placeholder to forbid any compliant ad networks from paying for ad placement
|
|
|
|
|
# on my domain.
|
2022-07-14 01:12:47 +00:00
|
|
|
|
User-Agent: Adsbot
|
|
|
|
|
Disallow: /
|
|
|
|
|
Allow: /ads.txt
|
2023-07-24 22:33:02 +00:00
|
|
|
|
Allow: /app-ads.txt
|
|
|
|
|
|
2024-05-06 21:44:22 +00:00
|
|
|
|
# Enabling our crawler to access your site offers several significant benefits
|
|
|
|
|
# to you as a publisher. By allowing us access, you enable the maximum number
|
|
|
|
|
# of advertisers to confidently purchase advertising space on your pages. Our
|
|
|
|
|
# comprehensive data insights help advertisers understand the suitability and
|
|
|
|
|
# context of your content, ensuring that their ads align with your audience's
|
|
|
|
|
# interests and needs. This alignment leads to improved user experiences,
|
|
|
|
|
# increased engagement, and ultimately, higher revenue potential for your
|
|
|
|
|
# publication. (https://www.peer39.com/crawler-notice)
|
|
|
|
|
# --> fuck off.
|
|
|
|
|
User-agent: peer39_crawler
|
|
|
|
|
User-Agent: peer39_crawler/1.0
|
|
|
|
|
Disallow: /
|
|
|
|
|
|
2024-03-13 06:23:28 +00:00
|
|
|
|
## IP-violation scanners ##
|
|
|
|
|
|
2023-07-24 22:33:02 +00:00
|
|
|
|
# The next three are borrowed from https://www.videolan.org/robots.txt
|
2022-07-14 01:12:47 +00:00
|
|
|
|
|
2024-05-06 21:44:22 +00:00
|
|
|
|
# > This robot collects content from the Internet for the sole purpose of #
|
|
|
|
|
# helping educational institutions prevent plagiarism. [...] we compare student
|
|
|
|
|
# papers against the content we find on the Internet to see if we # can find
|
|
|
|
|
# similarities. (http://www.turnitin.com/robot/crawlerinfo.html)
|
2022-06-13 04:52:28 +00:00
|
|
|
|
# --> fuck off.
|
2022-04-23 04:45:15 +00:00
|
|
|
|
User-Agent: TurnitinBot
|
|
|
|
|
Disallow: /
|
|
|
|
|
|
2024-05-06 21:44:22 +00:00
|
|
|
|
# > NameProtect engages in crawling activity in search of a wide range of brand
|
|
|
|
|
# and other intellectual property violations that may be of interest to our
|
|
|
|
|
# clients. (http://www.nameprotect.com/botinfo.html)
|
2022-06-13 04:52:28 +00:00
|
|
|
|
# --> fuck off.
|
2022-04-23 04:45:15 +00:00
|
|
|
|
User-Agent: NPBot
|
|
|
|
|
Disallow: /
|
|
|
|
|
|
2024-05-06 21:44:22 +00:00
|
|
|
|
# iThenticate is a new service we have developed to combat the piracy of
|
|
|
|
|
# intellectual property and ensure the originality of written work for#
|
|
|
|
|
# publishers, non-profit agencies, corporations, and newspapers.
|
|
|
|
|
# (http://www.slysearch.com/)
|
2022-06-13 04:52:28 +00:00
|
|
|
|
# --> fuck off.
|
2022-04-23 04:45:15 +00:00
|
|
|
|
User-Agent: SlySearch
|
|
|
|
|
Disallow: /
|
|
|
|
|
|
2024-05-06 21:44:22 +00:00
|
|
|
|
# BLEXBot assists internet marketers to get information on the link structure
|
|
|
|
|
# of sites and their interlinking on the web, to avoid any technical and
|
|
|
|
|
# possible legal issues and improve overall online experience.
|
|
|
|
|
# (http://webmeup-crawler.com/)
|
2022-06-13 04:52:28 +00:00
|
|
|
|
# --> fuck off.
|
|
|
|
|
User-Agent: BLEXBot
|
2022-04-23 04:45:15 +00:00
|
|
|
|
Disallow: /
|
|
|
|
|
|
2024-05-06 21:44:22 +00:00
|
|
|
|
# Providing Intellectual Property professionals with superior brand protection
|
|
|
|
|
# services by artfully merging the latest technology with expert analysis.
|
|
|
|
|
# (https://www.checkmarknetwork.com/spider.html/)
|
2023-07-24 22:33:02 +00:00
|
|
|
|
# "The Internet is just way to big to effectively police alone." (ACTUAL quote)
|
|
|
|
|
# --> fuck off.
|
|
|
|
|
User-agent: CheckMarkNetwork/1.0 (+https://www.checkmarknetwork.com/spider.html)
|
|
|
|
|
Disallow: /
|
|
|
|
|
|
2024-05-06 21:44:22 +00:00
|
|
|
|
# Stop trademark violations and affiliate non-compliance in paid search.
|
|
|
|
|
# Automatically monitor your partner and affiliates’ online marketing to
|
|
|
|
|
# protect yourself from harmful brand violations and regulatory risks. We
|
|
|
|
|
# regularly crawl websites on behalf of our clients to ensure content
|
|
|
|
|
# compliance with brand and regulatory guidelines.
|
|
|
|
|
# (https://www.brandverity.com/why-is-brandverity-visiting-me)
|
2023-07-25 06:43:13 +00:00
|
|
|
|
# --> fuck off.
|
|
|
|
|
User-agent: BrandVerity/1.0
|
|
|
|
|
Disallow: /
|
|
|
|
|
|
2024-03-13 06:23:28 +00:00
|
|
|
|
## Misc. icky stuff ##
|
|
|
|
|
|
2024-05-06 21:44:22 +00:00
|
|
|
|
# Pipl assembles online identity information from multiple independent sources
|
|
|
|
|
# to create the most complete picture of a digital identity and connect it to
|
|
|
|
|
# real people and their offline identity records. When all the fragments of
|
|
|
|
|
# online identity data are collected, connected, and corroborated, the result
|
|
|
|
|
# is a more trustworthy identity.
|
2024-03-13 06:23:28 +00:00
|
|
|
|
# --> fuck off.
|
|
|
|
|
User-agent: PiplBot
|
|
|
|
|
Disallow: /
|
|
|
|
|
|
|
|
|
|
## Gen-AI data scrapers ##
|
|
|
|
|
|
2023-04-08 01:05:37 +00:00
|
|
|
|
# Eat shit, OpenAI.
|
|
|
|
|
User-agent: ChatGPT-User
|
2023-08-06 23:54:29 +00:00
|
|
|
|
User-agent: GPTBot
|
|
|
|
|
Disallow: /
|
2023-04-08 01:05:37 +00:00
|
|
|
|
|
2024-03-13 05:14:49 +00:00
|
|
|
|
# Official way to opt-out of Google's generative AI training:
|
|
|
|
|
# <https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers>
|
2024-03-13 00:29:15 +00:00
|
|
|
|
User-agent: Google-Extended
|
|
|
|
|
Disallow: /
|
2023-04-08 01:05:37 +00:00
|
|
|
|
|
2024-03-13 05:14:49 +00:00
|
|
|
|
# There isn't any public documentation for this AFAICT.
|
|
|
|
|
# Reuters thinks this works so I might as well give it a shot.
|
2024-03-13 03:53:58 +00:00
|
|
|
|
User-agent: anthropic-ai
|
|
|
|
|
User-agent: Claude-Web
|
|
|
|
|
Disallow: /
|
2024-05-06 21:44:22 +00:00
|
|
|
|
# Extremely aggressive crawling with no documentation. people had to email the
|
|
|
|
|
# company about this for robots.txt guidance.
|
|
|
|
|
User-agent: ClaudeBot
|
|
|
|
|
Disallow: /
|
2024-03-13 03:53:58 +00:00
|
|
|
|
|
2024-05-06 21:44:22 +00:00
|
|
|
|
# FacebookBot crawls public web pages to improve language models for our speech
|
|
|
|
|
# recognition technology.
|
2024-03-13 06:23:28 +00:00
|
|
|
|
# <https://developers.facebook.com/docs/sharing/bot/?_fb_noscript=1>
|
|
|
|
|
User-Agent: FacebookBot
|
|
|
|
|
Disallow: /
|
|
|
|
|
|
2024-03-13 05:14:49 +00:00
|
|
|
|
# I'm not blocking CCBot for now. It publishes a free index for anyone to use.
|
|
|
|
|
# Googe used this to train the initial version of Bard (now called Gemini).
|
|
|
|
|
# I allow CCBot since its index is also used for upstart/hobbyist search engines
|
|
|
|
|
# like Alexandria and for genuinely useful academic work I personally like.
|
|
|
|
|
# I allow Owler for similar reasons:
|
|
|
|
|
# <https://openwebsearch.eu/owler/#owler-opt-out>
|
|
|
|
|
# <https://openwebsearch.eu/common-goals-with-common-crawl/>.
|
|
|
|
|
# Omgilibot/Omgili is similar to CCBot, except it sells the scrape results.
|
|
|
|
|
# I'm not familiar enough with Omgili to make a call here.
|
2024-03-13 06:23:28 +00:00
|
|
|
|
# In the long run, my embedded robots meta-tags and headers could cover gen-AI
|
2024-03-13 03:53:58 +00:00
|
|
|
|
|
2024-05-06 21:44:22 +00:00
|
|
|
|
# I don't block cohere-ai or Perplexitybot: they don't appear to actually
|
|
|
|
|
# scrape data for LLM training purposes. The crawling powers search engines
|
|
|
|
|
# with integrated pre-trained LLMs.
|
2024-03-21 01:34:55 +00:00
|
|
|
|
# TODO: investigate whether YouBot scrapes to train its own in-house LLM.
|
|
|
|
|
|
2022-04-23 04:45:15 +00:00
|
|
|
|
Sitemap: https://seirdy.one/sitemap.xml
|