1
0
Fork 0
mirror of https://git.sr.ht/~seirdy/seirdy.one synced 2024-09-20 04:12:09 +00:00
seirdy.one/content/posts/search-engines-with-own-indexes.gmi
2022-02-08 22:39:09 -08:00

300 lines
24 KiB
Text
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This is a cursory review of all the indexing search engines I have been able to find. Gemini engines are at the bottom; the rest of this post is about Web search engines.
The three dominant English search engines with their own indexes¹ are Google, Bing, and Yandex (GBY). Many alternatives to GBY exist, but almost none of them have their own results; instead, they just source their results from GBY.
With that in mind, I decided to test and catalog all the different indexing search engines I could find. I prioritized breadth over depth, and encourage readers to try the engines out themselves if theyd like more information.
I primarily evaluated English-speaking search engines because thats my primary language. With some difficulty, I could probably evaluate a Spanish one; however, I wasnt able to find many Spanish-language engines powered by their own crawlers.
This page is a “living document” that I plan on updating indefinitely. Check for updates once in a while if you find this page interesting. Feel free to send me suggestions, updates, and corrections; Id especially appreciate help from those who speak languages besides English and can evaluate a non-English indexing search engine. Contact info is in the article footer.
## Rationale
Google, Microsoft (the company behind Bing), and Yandex aren't just search engine companies; they're content and ad companies as well. For example, Google hosts video content on YouTube and Microsoft hosts social media content on LinkedIn. This gives these companies a powerful incentive to prioritize their own content. They are able to do so even if they claim that they treat their own content the same as any other: since they have complete access to their search engines' inner workings, they can tailor their content pages to better fit their algorithms and tailor their algorithms to work well on their own content. They can also index their own content without limitations but throttle indexing for other crawlers.²
One way to avoid this conflict of interest is to *use search engines that aren't linked to major content providers;* i.e., use engines with their own independent indexes.
There's also a practical, non-ideological reason to try other engines: different providers have different results. Websites that are hard to find on one search engine might be easy to find on another, so using more indexes and ranking algorithms results in access to more content.
## Methodology
I focused almost entirely on "organic results" (the classic link results), and didn't focus too much on (often glaring) privacy issues, "enhanced" or "instant" results (e.g. Wikipedia sidebars, related searches, Stack Exchange answers), or other elements.
I compared results for esoteric queries side-by-side; if the first 20 results were (nearly) identical to another engines results (though perhaps in a slightly different order), they were likely sourced externally and not from an independent index.
I tried to pick queries that should have a good number of results and show variance between search engines. An incomplete selection of queries I tested:
* “vim”, “emacs”, “neovim”, and “nvimrc”: Search engines with relevant results for “nvimrc” typically have a big index. Finding relevant results for the text editors “vim” and “emacs” instead of other topics that share the name is a challenging task.
* “vim cleaner”: should return results related to a line of cleaning products rather than the Correct Text Editor.
* “Seirdy”: My site is relatively low-traffic, but my nickname is pretty unique and visible on several of the highest-traffic sites out there.
* “Project London”: a small movie made with volunteers and FLOSS without much advertising. If links related to the movie show up, the engines really good.
* “oppenheimer”: a name that could refer to many things. Without context, it should refer to the physicist who worked on the atomic bomb in Los Alamos. Other historical queries: “magna carta” (intermediate), “the prince” (very hard).
## General indexing search-engines
### Large indexes, good results
These are large engines that pass all the above tests and more.
1. Google: the biggest index. Allows submitting pages and sitemaps for crawling, but requires login. Powers a few other engines:
* Startpage
* (discontinued) Runnaroo
* SAPO (Portuguese interface, can work with English results)
2. Bing: the runner-up. Allows submitting pages and sitemaps for crawling, but requires login. Its index powers many other engines:
* Yahoo
* DuckDuckGo³
* AOL
* Qwant (partial)⁴
* Ecosia
* Ekoru
* Privado
* Findx
* Disconnect Search⁵
* PrivacyWall
* Lilo
* SearchScene
* Peekier
* Oscobo
* Million Short
* Yippy search⁶
* Lycos
* Givero
* Swisscows
* Ask.moe⁷
* Partially powers MetaGer by default; this can be turned off
* At this point, I stopped adding Bing-based search engines. There are just too many.
3. Yandex: originally a Russian search engine, it now has an English version. Some Russian results bleed into its English site. Allows submitting pages and sitemaps for crawling, but requires login. Powers:
* Epic Search (went paid-only by June 2021)
* Occasionally powers DuckDuckGos link results instead of Bing.
4. Mojeek: Claims to be privacy-oriented. Quality isnt at Google/Bing/Yandexs level, but its not bad either. If I had to use Mojeek as my default general search engine, Id live. Partially powers eTools.ch.
5. Petal search: A search engine by Huawei that recently switched from searching for Android apps to general search. Despite its surprisingly good results, I wouldn't recommend it due to privacy concerns. Requires an account to submit sites. I discovered this via my access logs. Be aware that in some jurisdictions, it doesn't use its own index: in Russia and some EU regions it uses Yandex and Qwant, respectively.
=> https://petalsearch.com/ petalsearch.com
### Smaller indexes, relevant results
These engines pass most of the tests listed in the “methodology” section.
* Right Dao: very fast, good results. Passes the tests fairly well. It plans on including query-based ads if/when its userbase grows.⁸
* Gigablast: Its been around for a while and also sports a classic web directory. Searches are a bit slow, and it charges to submit sites for crawling. It powers Private.sh. Gigablast is tied with Right Dao for quality.
* Gowiki: Very young, small index, but shows promise. I discovered this in the seirdy.one access logs. Currently only available in the US. Seems down as of early 2022.
=> https://rightdao.com Right Dao
=> https://gigablast.com/ Gigablast
=> https://private.sh Private.sh
=> https://gowiki.com Gowiki
### Smaller indexes, hit-and-miss
These engines fail badly at a few important tests. Otherwise, they seem to work well enough.
* seekport: The interface is in German but it supports searching in English just fine. The default language is selected by your locale. Its really good considering its small index; it hasnt heard of less common terms (e.g. “Seirdy”), but its able to find relevant results in other tests. The server does not support TLS.
* Exalead: slow, quality is hit-and-miss. Its indexer claims to crawl the DMOZ directory, which has since shut down and been replaced by the Curlie directory. No relevant results for “Oppenheimer” and some other history-related queries. Allows submitting individual URLs for indexing, but requires solving a Google reCAPTCHA and entering an email address.
* ExactSeek: small index, disproportionately dominated by big sites. Failed multiple tests. Allows submitting individual URLs for crawling, but requires entering an email address and receiving a newsletter. Webmaster tools seem to heavily push for paid SEO options. It also powers SitesOnDisplay and Blog-search.com.
* sengine.info: only shows domains, not individual pages. Developed by netEstate GmbH, which specializes in content extraction for inprints and job ads. Also has a German-only version available.
=> http://www.seekport.com/ seekport (HTTP only)
=> https://www.exalead.com/search/ Exalead
=> https://curlie.org Curlie
=> https://www.exactseek.com/ ExactSeek
=> https://www.sengine.info/ sengine.info
* Meorca: A UK-based search engine that claims not to "index pornography or illegal content websites". It also features a public blog with a marketplace and free games. Allows submitting URLs, but requires a full name, email, phone number, and "business name" to do so. Discovered in the seirdy.one access logs.
* Infotiger: Allows choosing between multiple different sorting algorithms (date, text length, PageRank, and "AND"). Supports English and German. The legacy version allows submitting links via a web-form or email, but I don't know if that has an impact on the more recent alpha version.
* search.tl: Generalist search for one TLD at a time (defaults to .com). I'm not sure why you'd want to always limit your searches to a single TLD, but now you can.⁹ There isn't any visible UI for changing the TLD for available results; you need to add/change the "tld" URL paramater. For example, to search .org sites, append "&tld=org" to the URL. It seems to be connected to Amidalla.de, but Amidalla doesn't seem to currently be operational. Amidalla allows users to manually add URLs to its index and directory; I have yet to see if doing so impacts search.tl results.
* Kozmonavt: Has a small index of almost 5 million sites. If I want to find the website for a certain project, Kozmonavt works well (provided its index has crawled said website). It works poorly for learning things and finding general information. I cannot recommend it for anything serious since it lacks contact information, a privacy policy, or any other information about the org/people who made it. Discovered in the seirdy.one access logs.
* Burf.co: Very small index, but seems fine at ranking more relevant results higher. Allows site submission without any extra steps.
=> https://meorca.com/ Meorca Search Engine
=> https://alpha.infotiger.com/ Infotiger
=> http://www.search.tl search.tl
=> https://kozmonavt.ml/ Kozmonavt
=> https://burf.co/ Burf.co
* ChatNoir: An experimental engine by researchers that uses the Common Crawl index. The engine is open source. There's more information in its announcement on the Common Crawl mailing list (Google Groups).
=> https://www.chatnoir.eu/ ChatNoir
=> https://commoncrawl.org/ Common Crawl
=> https://github.com/chatnoir-eu ChatNoir source code (GitHub)
=> https://groups.google.com/g/common-crawl/c/3o2dOHpeRxo/m/H2Osqz9dAAAJ ChatNoir Announcement
### Unusable engines, irrelevant results
Results from these search engines dont seem at all useful.
* YaCy: community-made index; slow. Results are awful/irrelevant, but can be useful for intranet or custom search.
* Scopia: only seems to be available via the MetaGer metasearch engine after turning off Bing and news results. Tiny index, very low-quality.
* Artado Search: Primarily Turkish, but it also seems to support English results. Like Plumb, it uses client-side JS to fetch results from existing engines (Google, Bing, Yahoo, Petal, and others); like MetaGer, it has an option to use its own independent index. Results from its index are almost always empty. Very simple queries ("twitter", "wikipedia", "reddit") give some answers. Supports site submission and crowdsourced instant answers.
* Active Search Results: very poor quality
* Crawlson: young, slow. In this category because its index has a cap of 10 URLs per domain. I initially discovered Crawlson in the seirdy.one access logs.
* Anoox: Results are few and irrelevant; fails to find any results for basic terms. Allows site submission. It's also a lightweight social network and claims to be powered by its users, letting members vote on listings to alter rankings.
* Yioop!: A FLOSS search engine that boasts a very impressive feature-set: it can parse sitemaps, feeds, and a variety of markup formats; it can import pre-curated data in forms such as access logs, Usenet posts, and WARC archives; it also supports feed-based news search. Despite the impressive feature set, Yioop's results are few and irrelevant due to its small index. It allows submitting sites for crawling. Like Meorca, Yioop has social features such as blogs, wikis, and a chat bot API.
=> https://metager.org MetaGer
=> https://www.artadosearch.com/ Artado Search
=> https://www.activesearchresults.com Active Search Results
=> https://crawlson.com Crawlson
=> https://www.anoox.com/ Anoox
=> https://archive.is/oVAre Plumb CPO
=> https://www.yioop.com Yioop!
### Semi-independent indexes
Engines in this category fall back to GBY when their own indexes don't have enough results. As their own indexes grow, they claim that this should happen less often.
* Brave Search: Many tests (including all the tests I listed in the "Methodology" section) resulted results identical to Google, revealed by a side-by-side comparison with Google, Startpage, and a Searx instance with only Google enabled. Brave claims that this is due to how Cliqz (the discontinued engine acquired by Brave) used query logs to build its page models and was optimized to match Google.¹⁰ The index is independent, but optimizing against Google resulted in too much similarity for the real benefit of an independent index to show.
=> https://search.brave.com/ Brave Search
* Plumb: Almost all queries return no results; when this happens, it falls back to Google. It's fairly transparent about the fallback process, but I'm concerned about *how* it does this: it loads Google's Custom Search scripts from "cse.google.com" onto the page to do a client-side Google search. This can be mitigated by using a browser addon to block "cse.google.com" from loading any scripts. Plumb claims that this is a temporary measure while its index grows, and they're planning on getting rid of this. Allows submitting URLs, but requires solving an hCaptcha. This engine is very new; hopefully as it improves, it could graduate from this section. Its Chief Product Officer previously founded the Gibiru search engine which shares the same affiliates and (for now) the same index; the indexes will diverge with time.
=> https://plumb.one/ Plumb
* Neeva: Combines Bing results with results from its own index. Bing normally isn't okay with this, but Neeva managed to get an agreement with Bing that allowed result-mixing. As of right now, results are mostly identical to Bing but original links not found by Bing frequently pop up. Long and esoteric queries are less likely to feature original results. Requires signing up with an email address to use; after the (rather long) free trial it requires payment as an alternative to ads.
=> https://neeva.com/ Neeva
* Qwant: Qwant claims to use its own index, but it still relies on Bing for most results. It seems to be in a position similar to Neeva. Try a side-by-side comparison to see if or how it compares with Bing.
=> https://www.qwant.com Qwant
## Non-generalist search
These indexing search engines dont have a Google-like “ask me anything” endgame; theyre trying to do something different. You aren't supposed to use these engines the same way you use GBY.
* Wiby: I love this one. It focuses on smaller independent sites that capture the spirit of the “early” web. Its more focused on “discovering” new interesting pages that match a set of keywords than finding a specific resources. I like to think of Wiby as an engine for surfing, not searching. Runnaroo occasionally features a hit from Wiby. If you have a small site or blog that isnt very “commercial”, consider submitting it to the index.
* search.marginalia.nu: A recent addition similar to Wiby. It has its own crawler but is strongly biased towards non-commercial, personal, and/or minimal sites. It's a great response to the increasingly SEO-spam-filled SERPs of GBY.
* Search My Site: Similar to Wiby, but only indexes user-submitted personal and independent sites. It optionally supports IndieAuth.
* Quor: seems to mainly index large news sites. Site is down as of June 2021. Originally available at www dot quor dot com.
* Ninfex: a "people-powered" search engine that combines aspects of link aggregators and search. It lets users vote on submissions and it also displays links to forums about submissions.
=> https://wiby.me wiby.me
=> https://search.marginalia.nu/ search.marginalia.nu
=> https://wiby.org wiby.org
=> https://searchmysite.net Search My site
=> https://ninfex.com Ninfex
## Other languages
Im unable to evaluate these engines properly since I dont speak the necessary languages. English searches on these are a hit-or-miss. I might have made a few mistakes in this category.
### Big indexes
* Baidu: Chinese. Very large index; it's a major engine alongside GBY. Offers webmaster tools for site submission.
* Qihoo 360: Chinese. Im not sure how independent this one is.
* Sogou: Chinese
* Yisou: Chinese
* Naver: Korean. Allows submitting sitemaps and feeds. Discovered via some Searx metasearch instances.
* Seznam: Czech, seems relatively privacy-friendly. Discovered in the seirdy.one access logs. It allows site submission with webmaster tools.
* Cốc Cốc: Vietnamese
* go.mail.ru: Russian
=> https://search.naver.com Naver
=> https://www.seznam.cz/ Seznam
=> https://coccoc.com/search Cốc Cốc
=> https://go.mail.ru/ go.mail.ru
### Smaller indexes
* Parsijoo: Persian
* search.ch: Regional search engine for Switzerland; users can restrict searches to their local regions.
* fastbot: German
* Moose.at: German (Austria-based)
=> https://www.parsijoo.ir/ Parsijoo
=> https://search.ch search.ch
=> https://www.fastbot.de/ fastbot
=> https://www.moose.at Moose.at
## Misc
* Ask.com: the main site shut down, but subdomains like uk.ask.com are still alive. They claim to outsource search results. The results seem similar to Google, Bing, and Yandex; however, I cant pinpoint exactly where their results are coming from.
* Not evaluated: Apples search. Its only accessible through a search widget in iOS and macOS and shows very few results. This might change; see the next section.
* Not evaluated: Kagi Search. It's in a closed beta and I haven't yet gotten an invitation.
* Partially evaluated: Infinity Search. It has a young, small index. It recently split into a paid offering with the main index and Infinity Decentralized, the latter of which allows users to select from community-hosted crawlers. I managed to try it out before it became a paid offering, and it seemed decent; however, I wasnt able to run the tests listed in the “Methodology” section. Allows submitting URLs and sitemaps into a text box, no other work required.
=> https://uk.ask.com uk.ask.com
=> https://infinitysearch.co Infinity Search
=> https://infinitydecentralized.com/ Infinity Decentralized
## Upcoming engines
These engines arent ready yet; their indexes are either in a proof-of-concept phase with a handful of sites or arent available yet.
* Apple: given the activity of the AppleBot crawler lately, their index will almost certainly grow to a size large enough to power a general search engine soon. Check your servers access logs; theres a good chance its crawled your site if you have a few backlinks.
## Gemini search engines
Time for my first Gemini-exclusive content! A Gemini page about search engines wouldn't be complete without a few search engines for the Gemini space.
* gus.guru: the original Gemini search engine. The index doesn't seem to be updated anymore.
* geminispace.info: Another instance of GUS, but with an updated index.
=> gemini://gus.guru/ gus.guru
=> gemini://geminispace.info/ geminispace.info
## Graveyard
These engines were originally included in the article, but have since been discontinued.
* wbsrch: In addition to its generalist search, it also had many other utilities related to domain name statistics. Failed multiple tests. Its index was a bit dated; it had an old backlog of sites it hadnt finished indexing. It also had several dedicated per-language indexes.
=> https://xangis.com/the-wbsrch-experiment/ The Wbsrch Experiment
## Acknowledgements
Some of this content came from the Search Engine Map and Search Engine Party. A few web directories also proved useful.
=> https://www.searchenginemap.com/ Search Engine Map
=> https://searchengine.party/ Search Engine Party
Matt from Gigablast also gave me some helpful information on GBY which I included in the "Rationale" section. He's written more about big tech in the Gigablast blog:
=> https://gigablast.com/blog.html Gigablast blog
Nicholas A. Ferrell of The New Leaf Journal wrote a great post on alternative search engines.
=> https://thenewleafjournal.com/a-2021-list-of-alternative-search-engines-and-search-resources/ A 2021 List of Alternative Search Engines and Search Resources
=> gemini://gemlog.blue/users/naferrell/ N.A. Ferrell's Gemlog
He also gave me some useful details about Seznam, Naver, Baidu, and Goo:
=> https://lists.sr.ht/~seirdy/seirdy.one-comments/%3C20210618031450.rb2twu4ypek6vvl3%40rkumarlappie.attlocal.net%3E Re: Editor of The New Leaf Journal - Added Your Guestbook Comment Info to My Post + Feedback
## Notes
¹ Yes, “indexes” is an acceptable plural form of the word “index”. The word “indices” sounds weird to me outside a math class.
² Matt from Gigablast told me that indexing YouTube or LinkedIn will get you blocked if you aren't Google or Microsoft. I imagine that you could do so by getting special permission if you're a megacorporation.
³ DuckDuckGo has a crawler called DuckDuckBot. This crawler doesn't impact the linked results displayed; it just grabs favicons and scrapes data for a few instant answers. DuckDuckGo's help pages claim that the engine uses over 400 sources; my interpretation is that at least 398 sources don't impact organic results. I don't think DuckDuckGo is transparent enough about the fact that their organic results are proxied. Compare DuckDuckGo side-by-side with Bing and Yandex and you'll see it's sourcing organic results from one of them (probably Bing).
⁴ Qwant claims to also use its own crawler for results, but its still mostly Bing in my experience. See the "semi-independent" section.
⁵ Disconnect Search allows users to have results proxied from Bing or Yahoo, but Yahoo sources its results from Bing.
⁶ Yippy claims to be powered by a certain IBM brand (a brand that could correspond to any number of products) and annotates results with the phrase “Yippy Index”, but a side-by-side comparison with Bing and other Bing-based engines revealed results to be nearly identical.
⁷ Ask.moe was working on a FLOSS indexer; its search page stated an intention to switch to it from Bing at one point. This statement has since been removed.
=> https://git.sr.ht/~danskeren/spider.moe FLOSS indexer
⁸ This is based on a statement Right Dao made in on Reddit:
=> https://reddit.com/comments/k4clx1/_/ge9dwmh/?context=1 Right Dao on Reddit
=> https://web.archive.org/web/20210320042457/https://i.reddit.com/r/degoogle/comments/k4clx1/right_dao_a_new_independent_search_engine_that/ge9dwmh/?context=1 Archive of the Reddit thread
⁹ Some search engines support the "site:" search operator to limit searches to subpages/subdomains of a single site or TLD. "site:.one", for instance, limits searches to websites with the ".one" TLD.
¹⁰ More information can be found in a HN subthread and the Cliqz tech blog:
=> https://news.ycombinator.com/item?id=27593801 HN comment thread for "Introducing Brave Search Beta"
=> https://0x65.dev/blog/2019-12-06/building-a-search-engine-from-scratch.html Tech @ Cliqz: Building a search engine from scratch
=> https://0x65.dev/blog/2019-12-10/search-quality-at-cliqz.html Tech @ Cliqz: Search quality at Cliqz