From 4bc0d54ecb64d6545b92144604ad3cbad2ecb251 Mon Sep 17 00:00:00 2001 From: Rohan Kumar Date: Sat, 26 Feb 2022 17:15:14 -0800 Subject: [PATCH] Expand and re-locate "methodology" section --- .../posts/search-engines-with-own-indexes.gmi | 53 +++++++++++++----- .../posts/search-engines-with-own-indexes.md | 56 ++++++++++++++----- 2 files changed, 82 insertions(+), 27 deletions(-) diff --git a/content/posts/search-engines-with-own-indexes.gmi b/content/posts/search-engines-with-own-indexes.gmi index 3f8753a..a185b90 100644 --- a/content/posts/search-engines-with-own-indexes.gmi +++ b/content/posts/search-engines-with-own-indexes.gmi @@ -4,8 +4,6 @@ The three dominant English search engines with their own indexes¹ are Google, B With that in mind, I decided to test and catalog all the different indexing search engines I could find. I prioritized breadth over depth, and encourage readers to try the engines out themselves if they’d like more information. -I primarily evaluated English-speaking search engines because that’s my primary language. With some difficulty, I could probably evaluate a Spanish one; however, I wasn’t able to find many Spanish-language engines powered by their own crawlers. - This page is a “living document” that I plan on updating indefinitely. Check for updates once in a while if you find this page interesting. Feel free to send me suggestions, updates, and corrections; I’d especially appreciate help from those who speak languages besides English and can evaluate a non-English indexing search engine. Contact info is in the article footer. I plan on updating the engines in the top two categories with more info comparing the structured/linked data the engines leverage (RDFa vocabularies, microdata, microformats, JSON-LD, etc.) to help authors determine which formats to use. @@ -18,19 +16,11 @@ One way to avoid this conflict of interest is to *use search engines that aren't There's also a practical, non-ideological reason to try other engines: different providers have different results. Websites that are hard to find on one search engine might be easy to find on another, so using more indexes and ranking algorithms results in access to more content. -## Methodology +## About the list -I focused almost entirely on "organic results" (the classic link results), and didn't focus too much on (often glaring) privacy issues, "enhanced" or "instant" results (e.g. Wikipedia sidebars, related searches, Stack Exchange answers), or other elements. +I primarily evaluated English-speaking search engines because that’s my primary language. With some difficulty, I could probably evaluate a Spanish one; however, I wasn’t able to find many Spanish-language engines powered by their own crawlers. -I compared results for esoteric queries side-by-side; if the first 20 results were (nearly) identical to another engine’s results (though perhaps in a slightly different order), they were likely sourced externally and not from an independent index. - -I tried to pick queries that should have a good number of results and show variance between search engines. An incomplete selection of queries I tested: - -* “vim”, “emacs”, “neovim”, and “nvimrc”: Search engines with relevant results for “nvimrc” typically have a big index. Finding relevant results for the text editors “vim” and “emacs” instead of other topics that share the name is a challenging task. -* “vim cleaner”: should return results related to a line of cleaning products rather than the Correct Text Editor. -* “Seirdy”: My site is relatively low-traffic, but my nickname is pretty unique and visible on several of the highest-traffic sites out there. -* “Project London”: a small movie made with volunteers and FLOSS without much advertising. If links related to the movie show up, the engine’s really good. -* “oppenheimer”: a name that could refer to many things. Without context, it should refer to the physicist who worked on the atomic bomb in Los Alamos. Other historical queries: “magna carta” (intermediate), “the prince” (very hard). +See the "Methodology" section at the bottom to see how I evaluated each one. ## General indexing search-engines @@ -297,6 +287,43 @@ Two engines were excluded from this list for having a far-right focus. One engine was excluded because it seems to be built using cryptocurrency in a way I'd rather not support. +## Methodology + +### Discovery + +I find new engines by: + +* Monitoring certain web directories for changes in their search engine listings. +* Checking other curated lists of "good/bad bots" to spot search engines. +* Using search engines to discover search engines: searching for the names of less-popular engines often pulls up similar lists. +* Receiving suggestions from readers +* Compiling a list of regular expressions for user-agent strings I'm familiar with. Before I delete my server access logs, I extract user-agents that don't match that list along with the pages they request. +* Checking the Searx and Searxng projects for new integrations. + +### Evaluation + +I focused almost entirely on "organic results" (the classic link results), and didn't focus too much on (often glaring) privacy issues, "enhanced" or "instant" results (e.g. Wikipedia sidebars, related searches, Stack Exchange answers), or other elements. + +I compared results for esoteric queries side-by-side; if the first 20 results were (nearly) identical to another engine’s results (though perhaps in a slightly different order), they were likely sourced externally and not from an independent index. + +I tried to pick queries that should have a good number of results and show variance between search engines. An incomplete selection of queries I tested: + +* “vim”, “emacs”, “neovim”, and “nvimrc”: Search engines with relevant results for “nvimrc” typically have a big index. Finding relevant results for the text editors “vim” and “emacs” instead of other topics that share the name is a challenging task. +* “vim cleaner”: should return results related to a line of cleaning products rather than the Correct Text Editor. +* “Seirdy”: My site is relatively low-traffic, but my nickname is pretty unique and visible on several of the highest-traffic sites out there. +* “Project London”: a small movie made with volunteers and FLOSS without much advertising. If links related to the movie show up, the engine’s really good. +* “oppenheimer”: a name that could refer to many things. Without context, it should refer to the physicist who worked on the atomic bomb in Los Alamos. Other historical queries: “magna carta” (intermediate), “the prince” (very hard). + +Some less-mainstream engines have noticed this article, which is great! I've had excellent discussions with people who work on several of these engines. Unfortunately, this article's visibility also incentivizes some engines to optimize specifically for any methodology I describe. I've addressed this by keeping a long list of test queries to myself. The simple queries above are a decent starting point for simple quick evaluations, but I also test for common search operators, keyword length, and types of domain-specific jargon. I also use queries designed to pull up specific pages with varying levels of popularity and recency to gauge the size, scope, and growth of an index. + +Professional critics often work anonymously because personalization can damage the integrity of their reviews. For similar reasons, I attempt to try each engine anonymously at least once by using a VPN and/or my standard anonymous setup: an amnesiac Whonix VM with the Tor Browser. I also often test using a fresh profile when travelling, or via a Searx instance if it supports a given engine. When avoiding personalization, I use "varied" queries that I don't repeat verbatim across search engines; this reduces the likelihood of identifying me. I also attempt to spread these tests out over time so admins won't notice an unusual uptick in unpredictable and esoteric searches. + +### Caveats + +I didn't try to avoid personalization when testing engines that require account creation. Entries in the "hit-and-miss" and "unusable" sections got less attention: for instance, I didn't spend a lot of effort tracking results over time to see how new entries got added to them. + +I avoided "natural language" queries like questions, focusing instead on keyword searches and search operators. I also mostly ignored infoboxes (also known as "instant answers"). + ## Acknowledgements Some of this content came from the Search Engine Map and Search Engine Party. A few web directories also proved useful. diff --git a/content/posts/search-engines-with-own-indexes.md b/content/posts/search-engines-with-own-indexes.md index 0cd6e34..c46bcf1 100644 --- a/content/posts/search-engines-with-own-indexes.md +++ b/content/posts/search-engines-with-own-indexes.md @@ -18,8 +18,6 @@ The three dominant English search engines with their own indexes[^1] are Google, With that in mind, I decided to test and catalog all the different indexing search engines I could find. I prioritized breadth over depth, and encourage readers to try the engines out themselves if they'd like more information. -I primarily evaluated English-speaking search engines because that's my primary language. With some difficulty, I could probably evaluate a Spanish one; however, I wasn't able to find many Spanish-language engines powered by their own crawlers. - This page is a "living document" that I plan on updating indefinitely. Check for updates once in a while if you find this page interesting. Feel free to send me suggestions, updates, and corrections; I'd especially appreciate help from those who speak languages besides English and can evaluate a non-English indexing search engine. Contact info is in the article footer. I plan on updating the engines in the top two categories with more info comparing the structured/linked data the engines leverage (RDFa vocabularies, microdata, microformats, JSON-LD, etc.) to help authors determine which formats to use. @@ -33,20 +31,12 @@ One way to avoid this conflict of interest is to _use search engines that aren't There's also a practical, non-ideological reason to try other engines: different providers have different results. Websites that are hard to find on one search engine might be easy to find on another, so using more indexes and ranking algorithms results in access to more content. -Methodology ------------ +About the list +-------------- -I focused almost entirely on "organic results" (the classic link results), and didn't focus too much on (often glaring) privacy issues, "enhanced" or "instant" results (e.g. Wikipedia sidebars, related searches, Stack Exchange answers), or other elements. +I primarily evaluated English-speaking search engines because that’s my primary language. With some difficulty, I could probably evaluate a Spanish one; however, I wasn't able to find many Spanish-language engines powered by their own crawlers. -I compared results for esoteric queries side-by-side; if the first 20 results were (nearly) identical to another engine's results (though perhaps in a slightly different order), they were likely sourced externally and not from an independent index. - -I tried to pick queries that should have a good number of results and show variance between search engines. An incomplete selection of queries I tested: - -- "vim", "emacs", "neovim", and "nvimrc": Search engines with relevant results for "nvimrc" typically have a big index. Finding relevant results for the text editors "vim" and "emacs" instead of other topics that share the name is a challenging task. -- "vim cleaner": should return results related to a [line of cleaning products](https://en.wikipedia.org/wiki/Vim_%28cleaning_product%29) rather than the Correct Text Editor. -- "Seirdy": My site is relatively low-traffic, but my nickname is pretty unique and visible on several of the highest-traffic sites out there. -- "Project London": a small movie made with volunteers and FLOSS without much advertising. If links related to the movie show up, the engine's really good. -- "oppenheimer": a name that could refer to many things. Without context, it should refer to the physicist who worked on the atomic bomb in Los Alamos. Other historical queries: "magna carta" (intermediate), "the prince" (very hard). +See the "Methodology" section at the bottom to see how I evaluated each one. General indexing search-engines ------------------------------- @@ -211,6 +201,44 @@ Two engines were excluded from this list for having a far-right focus. One engine was excluded because it seems to be built using cryptocurrency in a way I'd rather not support. +Methodology +----------- + +### Discovery + +I find new engines by: + +- Monitoring certain web directories for changes in their search engine listings. +- Checking other curated lists of "good/bad bots" to spot search engines. +- Using search engines to discover search engines: searching for the names of less-popular engines often pulls up similar lists. +- Receiving suggestions from readers +- Compiling a list of regular expressions for user-agent strings I'm familiar with. Before I delete my server access logs, I extract user-agents that don't match that list along with the pages they request. +- Checking the Searx and Searxng projects for new integrations. + +### Evaluation + +I focused almost entirely on "organic results" (the classic link results), and didn't focus too much on (often glaring) privacy issues, "enhanced" or "instant" results (e.g. Wikipedia sidebars, related searches, Stack Exchange answers), or other elements. + +I compared results for esoteric queries side-by-side; if the first 20 results were (nearly) identical to another engine's results (though perhaps in a slightly different order), they were likely sourced externally and not from an independent index. + +I tried to pick queries that should have a good number of results and show variance between search engines. An incomplete selection of queries I tested: + +- "vim", "emacs", "neovim", and "nvimrc": Search engines with relevant results for "nvimrc" typically have a big index. Finding relevant results for the text editors "vim" and "emacs" instead of other topics that share the name is a challenging task. +- "vim cleaner": should return results related to a [line of cleaning products](https://en.wikipedia.org/wiki/Vim_%28cleaning_product%29) rather than the Correct Text Editor. +- "Seirdy": My site is relatively low-traffic, but my nickname is pretty unique and visible on several of the highest-traffic sites out there. +- "Project London": a small movie made with volunteers and FLOSS without much advertising. If links related to the movie show up, the engine's really good. +- "oppenheimer": a name that could refer to many things. Without context, it should refer to the physicist who worked on the atomic bomb in Los Alamos. Other historical queries: "magna carta" (intermediate), "the prince" (very hard). + +Some less-mainstream engines have noticed this article, which is great! I've had excellent discussions with people who work on several of these engines. Unfortunately, this article's visibility also incentivizes some engines to optimize specifically for any methodology I describe. I've addressed this by keeping a long list of test queries to myself. The simple queries above are a decent starting point for simple quick evaluations, but I also test for common search operators, keyword length, and types of domain-specific jargon. I also use queries designed to pull up specific pages with varying levels of popularity and recency to gauge the size, scope, and growth of an index. + +Professional critics often work anonymously because personalization can damage the integrity of their reviews. For similar reasons, I attempt to try each engine anonymously at least once by using a VPN and/or my standard anonymous setup: an amnesiac Whonix VM with the Tor Browser. I also often test using a fresh profile when travelling, or via a Searx instance if it supports a given engine. When avoiding personalization, I use "varied" queries that I don't repeat verbatim across search engines; this reduces the likelihood of identifying me. I also attempt to spread these tests out over time so admins won't notice an unusual uptick in unpredictable and esoteric searches. + +### Caveats + +I didn't try to avoid personalization when testing engines that require account creation. Entries in the "hit-and-miss" and "unusable" sections got less attention: I didn't spend a lot of effort tracking results over time to see how new entries got added to them. + +I avoided "natural language" queries like questions, focusing instead on keyword searches and search operators. I also mostly ignored infoboxes (also known as "instant answers"). + Acknowledgements ----------------