1
0
Fork 0
mirror of https://git.sr.ht/~seirdy/seirdy.one synced 2024-09-20 04:12:09 +00:00
seirdy.one/content/posts/search-engines-with-own-indexes.gmi
Rohan Kumar 83e6689b83
New article: search engines with their own indexes
Squashed commit of the following:

commit f04ef91062
Author: Rohan Kumar <seirdy@seirdy.one>
Date:   Wed Mar 10 13:47:35 2021 -0800

    Final draft

commit db25b73460
Author: Rohan Kumar <seirdy@seirdy.one>
Date:   Wed Mar 10 13:37:35 2021 -0800

    Add gemtext version

commit de8c4f1f28
Author: Rohan Kumar <seirdy@seirdy.one>
Date:   Wed Mar 10 12:52:20 2021 -0800

    New article (draft): search engines
2021-03-10 13:48:19 -08:00

196 lines
12 KiB
Text
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This is a cursory review of all the indexing search engines I have been able to find.
The three dominant English search engines with their own indexes¹ are Google, Bing, and Yandex (GBY). Many alternatives to GBY exist, but almost none of them have their own results; instead, they just source their results from GBY.
With that in mind, I decided to test and catalog all the different indexing search engines I could find. I prioritized breadth over depth, and encourage readers to try the engines out themselves if theyd like more information.
I primarily evaluated English-speaking search engines because thats my primary language. With some difficulty, I could probably evaluate a Spanish one; however, I wasnt able to find any Spanish-language engines powered by their own crawlers.
This page is a “living document” that I plan on updating indefinitely. Check for updates once in a while if you find this page interesting. Feel free to send me suggestions, updates, and corrections; Id especially appreciate help from those who speak languages besides English and can evaluate a non-English indexing search engine. Contact info is in the article footer.
## Methodology
I mainly evaluated link results, and didnt focus too much on (often glaring) privacy issues, “enhanced” or “instant” results (e.g. Wikipedia sidebars, related searches, StackExchange answers), or other elements.
I compared results for esoteric queries side-by-side; if the first 20 results were (nearly) identical to another engines results (though perhaps in a slightly different order), they were likely sourced externally and not from an independent index.
I tried to pick queries that should have a good number of results and show variance between search engines. An incomplete selection of queries I tested:
* “vim”, “emacs”, “neovim”, and “nvimrc”: Search engines with relevant results for “nvimrc” typically have a big index. Finding relevant results for the text editors “vim” and “emacs” instead of other topics that share the name is a challenging task.
* “vim cleaner”: should return results related to a line of cleaning products rather than the Correct Text Editor.
* “Seirdy”: My site is relatively low-traffic, but my nickname is pretty unique and visible on several of the highest-traffic sites out there.
* “Project London”: a small movie made with volunteers and FLOSS without much advertising. If links related to the movie show up, the engines really good.
* “oppenheimer”: a name that could refer to many things. Without context, it should refer to the physicist who worked on the atomic bomb in Los Alamos. Other historical queries: “magna carta” (intermediate), “the prince” (very hard).
## General indexing search-engines
### Large indexes, good results
These are large engines that pass all the above tests and more.
1. Google: the biggest index. Allows submitting pages and sitemaps for crawling, but requires login. Powers a few other engines:
* Startpage
* Runaroo
* SAPO (Portuguese interface, can work with English results)
2. Bing: the runner-up. Allows submitting pages and sitemaps for crawling, but requires login. Its index powers many other engines:
* Yahoo
* DuckDuckGo²
* AOL
* Qwant³
* Ecosia
* Ekoru
* Privado
* Findx
* Disconnect Search⁴
* PrivacyWall
* Lilo
* SearchScene
* Peekier
* Oscobo
* Million Short
* Yippy search⁵
* Lycos
* Givero
* Swisscows
* Ask.moe⁶
* Partially powers MetaGer by default; this can be turned off
* At this point, I stopped adding Bing-based search engines. There are just too many.
3. Yandex: originally a Russian search engine, it now has an English version. Some Russian results bleed into its English site. Allows submitting pages and sitemaps for crawling, but requires login. Powers:
* Epic Search
* Occasionally powers DuckDuckGos link results instead of Bing.
4. Mojeek: Claims to be privacy-oriented. Quality isnt at Google/Bing/Yandexs level, but its not bad either. If I had to use Mojeek as my default general search engine, Id live.
5. Petal search: gopetal.com and petalsearch.com. A very new engine developed by Huawei. Surprisingly good results; it passed all the listed tests. Requires an account to submit sites. I discovered this via my access logs.
=> https://www.gopetal.com/ gopetal.com
=> https://petalsearch.com/ petalsearch.com
### Smaller indexes, relevant results
These engines pass most of the tests listed in the “methodology” section.
* Right Dao : very fast, good results. Passes the tests fairly well.
* Gigablast : Its been around for a while and also sports a classic web directory. Searches are a bit slow, and it charges to submit sites for crawling. It powers Private.sh. Gigablast is tied with Right Dao for quality.
* Gowiki : Very young, small index, but shows promise. I discovered this in the seirdy.one access logs.
=> https://rightdao.com Right Dao
=> http://gigablast.com/ Gigablast
=> https://private.sh Private.sh
=> https://gowiki.com Gowiki
### Smaller indexes, hit-and-miss
These engines fail badly at a few important tests.
* seekport : The interface is in German but it supports searching in English just fine. The default language is selected by your locale. Its really good considering its small index; it hasnt heard of less common terms (e.g. “Seirdy”), but its able to find relevant results in other tests.
* Exalead : slow, quality is hit-and-miss. Its indexer claims to crawl the DMOZ directory, which has since shut down and been replaced by the Curlie directory. No relevant results for “Oppenheimer” and some other history-related queries. Allows submitting individual URLs for indexing, but requires solving a Google reCAPTCHA and entering an email address.
* wbsrch : In addition to its generalist search, it also has many other utilities related to domain name statistics. Failed multiple tests. Its index is a bit dated; it has an old backlog of sites it hasnt finished indexing.
* ExactSeek : small index, disproportionately dominated by big sites. Failed multiple tests. Allows submitting individual URLs for crawling, but requires entering an email address and receiving a newsletter. Webmaster tools seem to heavily push for paid SEO options.
=> http://www.seekport.com/ seekport
=> http://www.exalead.com/search/ Exalead
=> https://curlie.org Curlie
=> https://wbsrch.com/ wbsrch
=> https://www.exactseek.com/ ExactSeek
### Unusable engines, irrelevant results
Results from these search engines dont seem at all useful.
* YaCy: community-made index; slow. Results are awful/irrelevant, but can be useful for intranet or custom search.
* Scopia: only seems to be available via the MetaGer metasearch engine after turning off Bing and news results. Tiny index, very low-quality.
* Active Search Results : very poor quality
* Crawlson: young, slow. In this category because its index has a cap of 10 urls per domain. I initially discovered Crawlson in the seirdy.one access logs. The site seems to be down right now, so I didnt link it.
=> https://metager.org MetaGer
=> https://www.activesearchresults.com Active Search Results
## Non-generalist search
These indexing search engines dont have a Google-like “ask me anything” endgame; theyre trying to do something different.
* Wiby: I love this one. It focuses on smaller independent sites that capture the spirit of the “early” web. Its more focused on “discovering” new interesting pages that match a set of keywords than finding a specific resources. I like to think of Wiby as an engine for surfing, not searching. Runaroo occasionally features a hit from Wiby. If you have a small site or blog that isnt very “commercial”, consider submitting it to the index.
* Quor: seems to mainly index large news sites.
=> https://wiby.me wiby.me
=> https://wiby.org wiby.org
=> https://www.quor.com Quor
## Other languages
Im unable to evaluate these engines properly since I dont speak the necessary languages. English searches on these are a hit-or-miss. I might have made a few mistakes in this category.
### Big indexes
* Baidu: Chinese
* Qihoo 360: Chinese. Im not sure how independent this one is.
* Sogou: Chinese
* Yisou: Chinese
* Naver: Korean.
* Seznam: Czech, seems relatively privacy-friendly. Discovered in the seirdy.one access logs.
* Cốc Cốc: Vietnamese
=> https://search.naver.com Naver
=> https://www.seznam.cz/ Seznam
=> https://coccoc.com/search Cốc Cốc
### Smaller indexes
* Parsijoo: Persian
* search.ch: Regional search engine for Switzerland; users can restrict searches to their local regions.
* fastbot: German
* Moose.at: German (Austria-based)
=> http://www.parsijoo.ir/ Parsijoo
=> https://search.ch search.ch
=> https://www.fastbot.de/ fastbot
=> https://www.moose.at Moose.at
## Misc
* Ask.com: the main site shut down, but subdomains like uk.ask.com are still alive. They claim to outsource search results. The results seem similar to Google, Bing, and Yandex; however, I cant pinpoint exactly where their results are coming from.
* Not evaluated: Apples search. Its only accessible through a search widget in iOS and macOS and shows very few results. This might change; see the next section.
* Partially evaluated: Infinity Search young, small index. It recently split into a paid offering with the main index and Infinity Decentralized, the latter of which allows users to select from community-hosted crawlers. I managed to try it out before it became a paid offering, and it seemed decent; however, I wasnt able to run the tests listed in the “Methodology” section. Allows submitting URLs and sitemaps into a text box, no other work required.
=> https://uk.ask.com uk.ask.com
=> https://infinitysearch.co Infinity Search
=> https://infinitydecentralized.com/ Infinity Decentralized
## Upcoming engines
These engines arent ready yet; their indexes are either in a proof-of-concept phase with a handful of sites, or arent available yet.
* Apple: given the activity of the AppleBot crawler lately, their index will almost certainly grow to a size large enough to power a general search engine soon. Check your servers access logs; theres a good chance its crawled your site if you have a few backlinks.
* Brave announced that it will start its own independent search engine based on the now-discontinued Cliqz. I dont endorse the company, but I hope its results will be available through metasearch engines like Searx.
=> https://brave.com/search/ Brave Search
## Acknowledgements
Some of this content came from the Search Engine Map and Search Engine Party. A few web directories also proved useful.
=> https://www.searchenginemap.com/ Search Engine Map
=> https://searchengine.party/ Search Engine Party
## Notes
¹ Yes, “indexes” is an acceptable plural form of the word “index”. The word “indices” sounds weird to me outside a math class.
² DuckDuckGo has a crawler called DuckDuckBot. This crawler doesnt impact the linked results displayed; it just grabs favicons and scrapes data for a few instant answers
³ Qwant claims to also use its own crawler for results, but its still mostly Bing. Try a side-by-side comparison; I found that it doesnt seem to have anything besides Bing results.
⁴ Disconnect Search allows users to have results proxied from Bing or Yahoo, but Yahoo sources its results from Bing.
⁵ Yippy claims to be powered by a certain IBM brand (a brand that could correspond to any number of products) and annotates results with the phrase “Yippy Index”, but a side-by-side comparison with Bing and other Bing-based engines revealed results to be nearly identical.
⁶ Ask.moe was working on a FLOSS indexer; its search page stated an intention to switch to it from Bing at one point. This statement has since been removed.
=> https://git.sr.ht/~danskeren/spider.moe FLOSS indexer