Category: Search Engines
Link Rot: How to find new sources for busted links
There are a variety of ways to find what you're looking for when you come across a broken link on the interwebs. Here's a few methods i like to use.
Search operators
The first thing you should know is how to use a search engine. Various search engines will attach a special meaning to certain characters and these 'search operators' as they're called can be really helpful. Here's some handy examples that work for Google as well as some other search engines (and no, you shouldn't be using Google directly):
OR
: 'OR', or the pipe ( |
) character, tells the search engine you want to search for this OR that. For example cat|dog
will return results containing 'cat' or 'dog', as will cat OR dog
.
( )
: Putting words in a group separated by OR
or |
produces the same result as just described, however you can then add words outside of the group that you always want to see in the results. For example, (red|pink|orange) car
will return results that have red, pink or orange cars.
" "
: If you wrap a "word" in double quotes, you are telling the search engine that the word is really important. If you wrap multiple words in double quotes, you are telling the search engine to look for pages containing "that exact phrase."
site:
: If you want to search only a particular domain, such as 12bytes.org, append site:12bytes.org
to your query, or don't include any search terms if you want it to return a list of pages for the domain. You can do the same when preforming an image search if you want to see all the images on a domain. You can also search a TLD (Top-Level Domain) using this operator. For example, to search the entire .gov TLD, just append site:.gov
to your query.
-
: If you prefix a word with a -hyphen, you are telling the search engine to omit results containing this word. You can do the same -"with a phrase" also.
cache:
: Prefixing a domain with cache:
, such as cache:12bytes.org
, will return the most recent cached version of a page.
intitle:
: If you prefix a word or phase with intitle:
, you are telling the search engine that the word or phrase must be contained in the titles of the results.
allintitle:
: Words prefixed with allintitle:
tells the search engine that all words following this operator must be contained in the titles of the search results.
See this page for more examples.
Searching the archives
One of the simplest methods of finding the original target of a busted link is to copy the link location (right click the link and select 'Copy Link Location') and plug that into one of the web archive services. The two most popular, general archives that i'm aware of are the Internet Archive and Archive.is. The Internet Archive provides options to filter your search results for particular types of content, such as web pages, videos, etc.. In either case, just paste the copied link in the input field they provide and press your Enter key. If the link is 'dirty', cleaning it up may provide better results. For example, let's say the link is something like:
http://example.com/articles/1995/that-dog-dont-hunt?ref=example.com&partner=sombody&utm_source=google
The archive may not return any results for the URL, but it might if you clean it up by removing everything after 'hunt'.
There are also web browser extensions you can install to make accessing the archive services easier. For Firefox i like the View Page Archive & Cache add-on by 'Armin Sebastian'. When you find a dead link, just right-click it and from the 'View Page Archive' context menu you can select to search all of the enabled archives or just a specific one. Even if the page isn't dead you can right-click in the page and retrieve a cached copy or archive the page yourself. Another cool feature of this add-on is that it will place an icon in the address bar if you land on a dead page and you can just search for an archived version from the icon context menu.
Of these two services, the Internet Archive has a far more extensive library, but there's a very annoying caveat with it that defeats the purpose of an archive which is why i much prefer Archive.is. The Internet Archive follows robot.txt directives. I won't go into why i think this is stupid, suffice to say that content that is stored on the Internet Archive can be removed even if it does not break any of their rules.
Dead links and no clues
If all you have is a dead link with no title or description and you can't find a cached copy in one of the archives, you may still be able to find copy of the document somewhere. For example let's say the link is https://example.com/pages/my-monkey-stole-my-car.html
. The likely title of the document you're looking for is right in the URL -- my-monkey-stole-my-car
-- and you can plug that into a search engine just as it is, or remove the hyphens and wrap the title in double quotes to perform a phrase search. Also see some of the other examples here.
Dead links with some clues
If you come across a dead link that has a title or description, but isn't cached in an archive, you can use that to perform a search. Just select the title, or a short but unique phrase from the description (which preferably doesn't contain any punctuation), then wrap it in double quotes and perform a phrase search.
Dead internal website links
If you encounter a website that contains a broken link to another page on the same site and you have some information about the document, like a title or excerpt, you can do a domain search to see if a search engine may link to a working copy. For example, let's assume the title of the page we're looking for is 'Why does my kitten hate me?' on the domain 'example.com'. Copy the title, wrap it in double quotes and plug it into a search engine that supports phrase searches, add a space, then append site:example.com
. This will tell the search engine to look for results only on example.com. Also see some of the other examples here.
YouTube videos you know exist but can't find
Because there is a remarkable amount of censorship taking place at YouTube, they will sometimes hide sensitive videos from their search results when you use the search engine provided by YouTube. To get around this, use another search engine to perform a domain search as described in the 'Dead internal website links' section.
Deleted videos
In some cases, such as with a link that points to a removed YouTube video, you may not have any information other than the URL itself, not even a page title. Using the YouTube link as an example, https://www.youtube.com/watch?v=abc123xyz
, copy youtube.com/watch?v=abc123xyz
, wrap it in double quotes and plug that into your preferred search engine. You will often find a forum or blog post somewhere that will provide helpful clues, such as the video title or description which you can use to search for a working copy of the video. And the first place to look for deleted YouTube videos is YouTube! You can also search the Internet Archive as well as other video platforms that are more censorship resistant than YouTube, including Dailymotion, BitChute, DTube, LEEKWire and many others.
Broken links on your own website
I don't know about you, but i have nearly 4,000 links on 12bytes.org as of this writing and many of them point to resources which TPTB (The Powers That [shouldn't] Be) would rather you knew nothing about. As such, many of the resources i link to are taken down and so i have to deal with broken links constantly, many of them deleted YouTube videos. If you run WordPress (self-hosted - i don't know about a wordpress.com site) you will find Broken Link Checker by 'ManageWP' in the WordPress plugin repository and it's job is to constantly scan your site to look for broken links. While it is not a bug-free plugin (the developer is not at all responsive and doesn't seem to fix anything in a timely manner), it is by far the most comprehensive tool of its type that i'm aware of. There are also many external services you could use whether you run WordPress or not.
Alternative Search Engines That Respect Your Privacy
It's time to stop relying on corporations which do not respect our privacy. Here are some search engines that, unlike Google, Bing and Yahoo, have a stronger focus on protecting your privacy.
Following are some search engines which are more privacy-centric than those offered by the privacy-hating mega-corporations like Google and Bing.
Unlike meta search engines such as DuckDuckGo, Startpage, etc., which rely either partially or entirely upon third parties for their results (primarily Bing and Google), all search engines listed here maintain their own indexes meaning they actively crawl the web in search of new and updated content to add to their catalogs. A few are hybrids, meaning they rely partially upon a 3rd party engine.
Although meta search engines are often referred to as "alternative" search engines, they are not true alternatives since they are subject to the same censorship/de-ranking practices of the companies upon which they rely. Such search engine companies are really proxies in that they may provide a valuable service by insulating you from privacy intrusive third party services, however this is not always the case. To gain some insight as to the relationships between search engines, see the excellent info-graphic provided by The Search Engine Map website.
If you are going to use a meta search engine which relies upon a 3rd party, those which depend on Microsoft's Bing seem to return generally better results than those which rely upon Google, especially when searching for sensitive and censored information, though i don't expect this to last since Bing and DuckDuckGo are working together to censor Bing's results.
If you have any indexing search engines you would like to suggest, leave a comment (you need not be logged in). To install search engine plugins for Firefox, see Firefox Search Engine Cautions, Recommendations.
Legend:
- Decentralized: (yes/no) whether or not the service depends upon centralized servers or is distributed among its users, such as YaCy
- Type: (index/hybrid) indexing search engines crawl the web and index content without relying on a 3rd party, whereas hybrid search engines are a combination of both meta and index
- Requires JS / Cookies: (yes/no) whether the website requires JavaScript and/or cookies (web storage) in order to function
- Self Contained: (yes/no) whether or not the website uses 3rd party resources, such as Google fonts, etc.
- Client Required: (yes/no) whether or not you have to install client software in order to use the service
- License: (proprietary/<license type>) whether the source code is available and, if so, the license type
- Privacy Policy: a link to their privacy policy
Brave Search
Search Page | Decentralized | Type | Requires JS / Cookies | Self Contained | Client Required | License | Privacy Policy |
search page | no | hybrid | JS: no 1 Cookies: no 2 |
yes | no | proprietary | link |
Brave Search is in the process of building its own index, however until that is complete it also pulls results from 3rd parties, primarily Bing and Google.
The search interface is attractive and intuitive. Unfortunately there are few options for tailoring the search results or the interface, however some of the more important options are in place, including regional and date search options.
Gigablast
Search Page | Decentralized | Type | Requires JS / Cookies | Self Contained | Client Required | License | Privacy Policy |
search page | no | index | JS: yes Cookies: no |
yes | no | Apache License 2.0 | ? |
Gigablast is a free and open source search engine that maintains its own index.
The search interface offers some useful options, such as selecting the format of the output, several interesting sorting options, time span options, file type options and plenty of advanced syntax options.
I couldn't find a privacy policy, but decided to include it anyway since it is open source.
You can install and run Gigablast on your own server.
Good Gopher
Search Page | Decentralized | Type | Requires JS / Cookies | Self Contained | Client Required | License | Privacy Policy |
search page | no | index | JS: no Cookies: no |
no | no | proprietary | link |
Good Gopher was apparently developed by Mike Adams, editor of the NaturalNews.com website, and appears to be unmaintained.
As stated in the Good Gopher privacy policy, their search results are censored in that they filter out what they and their users consider to be "corporate propaganda and government disinfo", while simultaneously promoting the undisputed heavyweight king of propaganda and disinformation, Alex "Bullhorn" Jones.
The core of their privacy policy consists of a few vague paragraphs, the bulk of which has nothing to do with user privacy.
Revenue is generated by displaying ads in the search results, though they state they are very particular about who may advertise on the platform.
LookSeek
Search Page | Decentralized | Type | Requires JS / Cookies | Self Contained | Client Required | License | Privacy Policy |
search page | no | index | JS: no Cookies: no |
no | no | proprietary | link |
LookSeek appears to be owned by Applied Theory Networks LLC and apparently has been around a while. The software seems to be propitiatory, but they do have a decent, clear and brief privacy policy.
The search interface is rudimentary, to say the least, and there doesn't appear to be any configuration options.
LookSeek states they have "no political or social bias".
Marginalia Search
Search Page | Decentralized | Type | Requires JS / Cookies | Self Contained | Client Required | License | Privacy Policy |
search page | no | index | JS: no Cookies: no |
yes | no | GPL v3+ | link |
Marginalia Search is a very interesting, open source, niche search engine which describes itself as "an independent DIY search engine that focuses on non-commercial content, and attempts to show you sites you perhaps weren't aware of in favor of the sort of sites you probably already knew existed".
One very useful aspect of Marginalia Search is that it allows you to choose the search result ranking algorithm which compiles the search results in different ways, such as by focusing on blogs and personal websites, academic sites, popular sites, etc..
Another potentially unique feature of Marginalia Search is that the results include some information about the website, such as how well the site fits with your search terms, what markup language it is written in and whether it uses JavaScript and/or cookies. Additional information is also provided regarding the content and dependencies for a given site, including whether it employs tracking /analytics, whether it contains media such as video or audio, and whether it contains affiliate links, such as Amazon links.
Mojeek
Search Page | Decentralized | Type | Requires JS / Cookies | Self Contained | Client Required | License | Privacy Policy |
search page | no | index | JS: no Cookies: no 2 |
yes | no | proprietary | link |
Mojeek is a UK based company founded in 2004. The company operates it's own crawler and promises to return unbiased results. I think Mojeek is currently the most usable and one of the most promising of all the search engines listed here. Mojeek is very open about how they operate and development of the search engine and its algorithms are driven in part by soliciting input from users.
The search interface is clean and they offer quite a few options to customize how searching works and how the interface looks. Also available are advanced search options and another tool it calls 'Focus' which can direct search terms to specific domains. One can also configure how many search results per domain are returned and if more than that number are available, Mojeek adds a link under the result which will open a new page with those results when clicked. If you enter a domain as the search term, Mojeek offers the option to search within that domain. The engine also supports some search operators including site:
and since:
, the latter of which is similar to the date:
operator used by Google.
Mojeek has a simple, clear and solid privacy policy.
Private.sh
Search Page | Decentralized | Type | Requires JS / Cookies | Self Contained | Client Required | License | Privacy Policy |
search page | no | index | JS: yes Cookies: no |
yes | no | proprietary | ? |
Private.sh uses the Gigablast engine and is therefore very similar in terms of search results. I felt it was worth having its own entry because they offer additional layers of privacy which strips your IP address and encrypts searches on the client using JavaScript before they are sent to the server, thus even Private.sh apparently doesn't know what you're searching for. As with Gigablast however, there is no privacy policy.
Right Dao
Search Page | Decentralized | Type | Requires JS / Cookies | Self Contained | Client Required | License | Privacy Policy |
search page | no | index | JS: no Cookies: no |
yes | no | proprietary | link |
Right Dao is a U.S. based company.
The search interface is bare and there are no options other than the ability to perform an advanced search. There are only two scopes of searches, they being web and news.
Right Dao searches seem to be fairly comprehensive and so this search engine is a solid choice when looking for politically sensitive information that Google and others censor. While the engine accepts phrase searches, that functionality seems to be very broken.
Their privacy policy is reasonably strong.
Wiby
Search Page | Decentralized | Type | Requires JS / Cookies | Self Contained | Client Required | License | Privacy Policy |
search page | no | index | JS: no Cookies: no |
yes | no | proprietary | link |
Wiby is an interesting, open-source search engine which is building an index of personal, rather than corporate websites. The interface is very plain and there was only one option in the settings, however it was designed to work well with older hardware.
YaCy
Search Page | Decentralized | Type | Requires JS / Cookies | Self Contained | Client Required | License | Privacy Policy |
unavailable | yes | index | JS: yes Cookies: no |
yes | optional | GPL v2+ | ? |
While YaCy doesn't produce a lot of search results since not enough people use it yet, i think it's one of the more interesting search engines listed here.
YaCy is a decentralized, distributed, censorship resistant search engine and index powered by free, open-source software. For those wanting to run their own instance of YaCy, see their home and GitHub pages. This article from Digital Ocean may also be helpful.
Footnotes
- While JavaScript is not strictly required, functionality may be reduced if it is disabled.
- Refusing to accept cookies may result in settings not being saved.
Upcoming search engines
Alexandria
Alexandria is a very new, open-source search engine with its own index, though it's currently built using a 3rd party. The first version of the source code appeared on GitHub in late 2021. The index is very small at the moment and therefore the service isn't really useful yet.
The interface is sparse and there are currently no options for customizing anything, however there are plans to improve the service.
There was no formal privacy policy at the time of writing, however the little information there is indicates a strong regard for privacy. By default they store IP addresses along with search queries in order to improve the service, however they promise to never share this information and there is an option to disable this behavior.
Alexandria is worth keeping an eye on.
I contacted Alexandria in April of 2022 with some questions. Following is our exchange:
Q: what are your values regarding user privacy?
A: We care a lot about user privacy and plan to let users decide how much they want to share. We run Alexandria.org as a non-profit so we have no incentive to store any info other than to make the search results better.
Q: i see that you have a dependency on rsms.me - depending on 3rd parties is always a privacy and security concern and i think it is often unnecessary - it looks like it's only css that's being imported at the moment, but do you plan on adding any other 3rd party dependencies?
A: Yes we use the Inter font which is open source, we just think it is a nice looking font. We generally have a high threshold for using a 3rd party dependencies but I think it is impossible to build everything ourselves so if there are things other people are better at than us and it is not in our core mission to build it we will use third party solutions. For example we depend on Hetzner for servers, we depend on commoncrawl for high volume scraping. But it's quite likely that we remove that dependency when we redesign the website next time.
Q: what are the long-term goals for Alexandria?
A: The long terms goal is to make knowledge as accessible as possible to as many people as possible. We want to give the users of alexandria.org info that are in their best interest without having to think about advertisers or other third parties.
Q: will you offer unbiased results?
A: Our bias should be to show the results that are likely to be the most useful for users, so that is what we are aiming for.
Q: do you respect robots.txt? personally i'm fine with it if you do not since it seems Big Tech is making it difficult for the little guy to compete in this market
A: Our index is primarily built with data from Common Crawl. But when we do crawling our self we respect robots.txt. Our main problem with scraping is not robots.txt, but that many big/valuable sources of information are behind cloudflare and similar services or otherwise closed to scarping.
Q: how do you plan to finance the project?
A: In the long term we hope to be able to finance it with donations.
Q: what is the current size of your index roughly (pages) and at what rate is it growing?
A: Right now we are just using a very small index while rebuilding big parts of the system. The current index is around 100 million urls. Pretty soon we plan to have 10 billion urls indexed.
Q: what search operators will you/do you support (site:, title:, date: etc.)?
A: None right now. The first one we will implement is site: since it is quite simple.
Q: because the code is available, will anyone be able to run Alexandria on their own server and how will that work? will each instance be independent, or might the indexes be shared across all servers?
A: Our index is not open source at the moment. So anyone who want's to create their own search engine will have to create their own index by crawling the web themselves or downloading data from common crawl or similar.
Presearch
Presearch is (currently) yet another meta search engine which is ultimately powered by Big Tech in that it relies on multiple corporate giants for its search results.
Presearch appears to be largely centralized at the moment, though decentralization is a stated goal. In the future Presearch is to be powered largely or entirely by the community in that anyone can run a node and help build an index with content curated by users.
The interface is interesting in that you can select among many different search categories, however it unnecessarily requires JavaScript to be enabled before one can initiate a search and again to display any results.
Presearch uses code from several 3rd parties including bootstrapcdn.com, coinmarketcap.com, cloudfront.net and hcaptcha.com. Such dependencies are often unnecessary, resulting in bloated and potentially insecure platforms which may not be privacy friendly.
Presearch incorporates "PRE" tokens, yet another form of digital currency which is apparently used for a variety of purposes including to incentivize people to use Presearch, financing the growth of infrastructure and to insure the integrity of the platform. While people can apparently earn "PRE" when using the search engine, withdrawing their earnings appears to be a convoluted process which is not always successful (see here and here for example).
While Presearch may have potential, the realization of its goals of decentralization and the building of its own index need to be met before it becomes a viable service.
De-listed search engines
DuckDuckGo
DuckDuckGo has openly admitted to censoring and de-ranking search results as well as working with Microsoft's Bing in order to influence their results (DuckDuckGo relies heavily on Bing). In one instance they blacklisted voat.co, a former free speech social platform, and on March 10, 2022, DuckDuckGo's CEO, Gabriel Weinberg, tweeted the following:
Like so many others I am sickened by Russia’s invasion of Ukraine and the gigantic humanitarian crisis it continues to create. #StandWithUkraine️ At DuckDuckGo, we've been rolling out search updates that down-rank sites associated with Russian disinformation.
Weinberg apparently had no problem when the U.S. invaded Iraq, Syria, Libya, etc., nor any problem with Black Lives Matter and Antifa terrorists burning and looting cities throughout the U.S., but he suddenly developed a selective crises of conscious when Russia invades Ukraine, which happens to be full of U.S. supported terrorists.
DuckDuckGo also admitted to influencing Microsoft's Bing search results according to a New York Times article:
DuckDuckGo said it "regularly" flagged problematic search terms with Bing so they could be addressed.
DuckDuckGo continues its race to the bottom. From an April 15, 2022, TorrentFreak article:
DuckDuckGo 'Removes' Pirate Sites and YouTube-DL from Its Search Results:
Privacy-centered search engine DuckDuckGo has completely removed the search results for many popular pirates sites including The Pirate Bay, 1337x, and Fmovies. Several YouTube ripping services have disappeared, too and even the homepage of the open-source software youtube-mp3 is unfindable.
On or around 25 May, 2022, it was discovered that DuckDuckGo was allowing tracking by Microsoft:
DuckDuckGo is not safe to browse as Microsoft tracks user data | Technology News – India TV
DuckDuckGo's founder Gabriel Weinberg has admitted to the company's agreement with Microsoft for allowing them to track the user's activity. He further stated that they are taking to Microsoft to change their agreement clause for users' confidentiality.
The trouble with DuckDuckGo began much earlier with its Jewish founder, Gabriel Weinberg:
Why People Should Never Ever Use DuckDuckGo | Techrights
DDG's founder (Gabriel Weinberg) has a history of privacy abuse, starting with his founding of Names DB, a surveillance capitalist service designed to coerce naive users to submit sensitive information about their friends. (2006)
Qwant
Qwant's privacy policy has apparently deteriorated. They collect quite a lot of data, some of which they share with 3rd parties. Most disturbingly is, like DuckDuckGo, they censor results. Someone from Qwant tweeted the following on March 1, 2022:
#UkraineRussiaWar In accordance with the EU sanctions, we have removed the Russian state media RT and Sputnik from our results today. The neutral web should not be used for war propaganda.
For more information see:
- Search Engines - which one to choose? (search for "Qwant")
- Qwant admits censorship
- Privacy policy - About Qwant
Startpage
As of somewhere around 2018 or 2019, Startpage was partially bought out by Privacy One Group/System1 which appears to be a data collection/advertising company. Source: Software Removal | Startpage.com
Other search engines
The Search Engine Party website by Andreas is well worth visiting. He has done an excellent job of compiling a large list of search engines and accompanying data. Also see the 'A look at search engines with their own indexes' page by Rohan Kumar who did an excellent job of compiling a list of engines that maintain their own index, however do note that privacy was not considered.
Reader suggested search engines that didn't make the cut
Cliqz
The Cliqz search engine, which is an index and not a proxy, is largely owned by Hubert Burda Medi. The company offers a "free" web browser built on Firefox.
It appears there are two primary privacy policies which apply to the search engine and both are a wall of text. As is often the case, they begin by telling readers how important your privacy is ("Protecting your privacy is part of our DNA") and then spend the next umpteen paragraphs iterating all the allegedly non-personally identifying data they collect and the 3rd party services they use to process it, which then have their own privacy policies.
In 2017 the morons at Mozilla corporate made the mistake of partnering with Cliqz and suffered significant backlash when it was discovered that everything users typed in their address bar was being sent to Cliqz. You can read more about this on HN, as well as a reply from Cliqz, also on HN.
Gibiru
I was anxious to try this engine after seeing it listed in NordVPN's article, TOP: Best Private Search Engines in 2019! and so i loaded the website and i liked what they had to say. Unfortunately, Gibiru not only depends on having JavaScript enabled, it depends on having it enabled for Google as well. Fail! It seems Gibiru is little more than a Google front-end and a poor one at that.
Search Encrypt
I added Search Encrypt to the list and later removed it. The website uses cookies and JavaScript by default, their ToS is a wall of corporate gibberish and their privacy policy is weak.
Lastly, Search Encrypt doesn't seem to provide any information about how they obtain their search results, though both the results and interface reek of Google and reading between the lines clearly indicates it is a meta search engine.
Search Encrypt was also recommended by NordVPN who seems happy to promote such garbage.
Yippy
Like Search Encrypt, Yippy, bought by DuckDuckGo, was another ethically challenged company with a poor privacy policy looking to attract investors. Yippy used cookies by default and wouldn't function without JavaScript. Yippy was also recommended by NordVPN.
Evaluating search engines
There are several tests that you can perform in order to determine the viability of a search engine. To get a sense of whether the results are biased, i often search for highly controversial subjects such as "holocaust revisionism". If you preform such a search using Google, Bing or DuckDuckGo, with or without quoting it, most or all of the first results link only to mainstream sources which attempt to debunk the subject rather than provide information regarding it. If you perform the same query using Mojeek however, the difference quite dramatic. Rohan Kumar also offers several great tips for evaluating search engines in his article, A look at search engines with their own indexes:
- "vim", "emacs", "neovim", and "nvimrc": Search engines with relevant results for "nvimrc" typically have a big index. Finding relevant results for the text editors "vim" and "emacs" instead of other topics that share the name is a challenging task.
- "vim cleaner": should return results related to a line of cleaning products rather than the Correct Text Editor.
- "Seirdy": My site is relatively low-traffic, but my nickname is pretty unique and visible on several of the highest-traffic sites out there.
- "Project London": a small movie made with volunteers and FLOSS without much advertising. If links related to the movie show up, the engine’s really good.
- "oppenheimer": a name that could refer to many things. Without context, it should refer to the physicist who worked on the atomic bomb in Los Alamos. Other historical queries: "magna carta" (intermediate), "the prince" (very hard).
Lessons learned from the Findx shutdown
The founder of the Findx search engine, Brian Rasmusson, shut down operations and detailed the reasons for doing so in a post titled, Goodbye – Findx is shutting down. I think the post is of significant interest not only to the end user seeking alternatives to the ethically corrupt mega-giants like Google, Bing, Yahoo, etc., but also to developers who have an interest in creating a privacy-centric, censorship resistant search engine index from scratch. Following are some highlights from the post:
Many large websites like LinkedIn, Yelp, Quora, Github, Facebook and others only allow certain specific crawlers like Google and Bing to include their webpages in a search engine index (maybe something for European Commissioner for Competition Margrethe Vestager to look into?) Other sites put their content behind a paywall. [...]Most advertisers won’t work with you unless you either give them data about your users, so they can effectively target them, or unless you have a lot of users already. Being a new and independent search engine that was doing the time-consuming work of growing its index from scratch, and being unwilling to compromise on our user’s privacy, Findx was unable to attract such partners. [...]We could not retain users because our results were not good enough, and search feed providers that could improve our results refused to work with us before we had a large userbase … the chicken and the egg problem. [...]From forbidding crawlers to index popular and useful websites and refusing to enter into advertising partnerships without large user numbers, to stacking the requirements for search extension behaviour in browsers, the big players actively squash small and independent search providers out of their market.
I think the reasons for the Findx shutdown highlight the need for decentralized, peer-to-peer solutions like YaCy. If we consider the problems Findx faced with the data harvesting, social engineering giants like Google, Facebook and the various CDN networks like Cloudflare, i think they are the sort of problems that can be easily circumvented with crowdsourced solutions. Any website can block whatever search crawler they want and there can be good reasons for doing so, but as Brian points out, there are also stupid and unethical reasons for doing so. With a decentralized P2P solution anyone could run a crawler and this could mitigate a lot of problems, plus force the walled garden giants such as Facebook to have their content scraped.
Resources
- 5 Best Search Engines That Respect Your Privacy - BestVPN.com
- 12 Private Search Engines that Do Not Track You - Hongkiat
- Alternative Search Engines | Oregon Computer Solutions
- Distributed Search Engines - P2P Foundation
- P2P Search as an Alternative to Google: Recapturing network value through decentralized search » The Journal of Peer Production
- Search Engine Party
- Search Engines - which one to choose?
- The Search Engine Map
Recent changes to this document
17-Mar-2023
- fixed a broken link
Firefox Search Engine Cautions, Recommendations
This tutorial will cover how to sanitize and add search engine plugins for Mozilla Firefox in order to protect your privacy.
See the revision history at the end of this document.
When 'free' software isn't
Have you ever wondered how Mozilla gets paid by the privacy-hating mega-monopolies like Google? Simple; when you use the default search engine plugins that are packaged with the browser, parameters similar to these are added to your search query:
client=firefox name="appid" value="ff" name="hspart" value="mozilla"
These parameters inform the search engine that you're using a Firefox/Mozilla product and that, in part, is how Mozilla is able to rake in millions annually. I would have no problem whatsoever with Mozilla making money were it an ethical company, but it isn't. If you do not wish to support Mozilla for partnering with highly unethical companies like Google or want to punish them for the many other stupid things they've done, read on.
Types of search engines
The two primary types of search engines are meta search engines and search indexes and it is important to understand the difference. Google, Yahoo and Bing for example use software "robots" called "crawlers" to discover and index web content. In other words these companies actively seek out updated and fresh content to store in their databases so it's ready for you to find. On the other hand, meta search engines do not index the web and instead rely upon third parties such as Google and/or Bing to provide their search results (most use Bing). When you use these so-called "alternative" search engines, such as DuckDuckGo, Startpage, Searx, etc., you are still subject to the filter bubbles and censorship that is practiced by the corporate giants. That said, privacy-respecting meta search engines may still have value because they offer a method to access the data-harvesting corporate giants without the privacy violations that accessing them directly would incur. Understand though that they are not true alternatives as they are often described, but more like proxies. These "alternative" search engines are also subject to local laws, such as secret surveillance requests issued by a government.
Indexing the web involves storing massive amounts of data and having the bandwidth to deliver the search results and this is an incredibly difficult and expensive proposition that requires significant resources and infrastructure. This is why meta search companies like DuckDuckGo, Startpage, Qwant and others rely heavily upon corporations like Alphabet's Google and Microsoft's Bing. There are better alternatives that both respect your privacy and are censorship resistant however. Ever hear of a peer-to-peer distributed search engine? Imagine a free, open-source, decentralized search engine where the web index is created and distributed by ordinary people using personal computers, each storing a piece of the whole. This is what the developers behind YaCy have done with their search engine and i think it's a great way to escape the filter bubbles created by big tech, however YaCy is not yet a viable search engine as of this writing. Mojeek, although it's a centralized search engine, is very focused on privacy, maintains it's own index, and is quite usable. For a list of alternative search engines, see Alternative Search Engines That Respect Your Privacy.
Adding search engines to Firefox
To mitigate potential risks to your anonymity posed by the default Firefox search engines, simply disable all of them and use alternatives. One easy way to add a search engine to Firefox is to find one you like and then right-click the address bar and click the "Add..." menu item. Most search engines can be added to Firefox in the same way, but there are additional methods also.
Another easy way to add a custom search engine to Firefox is with the Search Engines Helper add-on by Soufiane Sakhi.
Yet another way to add custom search engines is by using the mozlz4-edit add-on by 'serj_kzv'. This extension allows you to edit the search.json.mozlz4
search engine plugin file directly from within Firefox, though a browser restart is necessary before the changes are realized. This file is located in your Firefox profile directory and it is here that Firefox stores the code for all of its search engine plugins. If you use this tool, be careful not to touch the default search engines in the file, else Firefox will discard all your changes. Instead you can create copies of the default engines and edit the copies if you want to use them.
Manually editing search.json.mozlz4
If you would rather avoid the hassle of manually editing the default Firefox search engine plugins, see the 'Download preconfigured search plugins' section below where you can download my search.json.mozlz4
file.
If you don't want to manually edit the default Firefox search engine plugins you should at least use something like the ClearURLs add-on or the ClearURLs for uBo list which requires uBlock Origin and which strips the tracking parameters from the search result links. You should also disable JavaScript for all mainstream search engine websites where possible, especially Google and Bing. For this i would again recommend uBlock Origin by Raymond Hill.
If you have already added custom search engines to Firefox, create a copy of search.json.mozlz4
and work with the copy, reason being that if you mess up, Firefox will will delete all of your modifications and restore the default search plugins. If you don't want to see or use the default engines, simply disable them in the search preferences of Firefox. And no, as far as i know you cannot remove the default search engine plugins. If you don't know where your Firefox profile is located, load about:profiles in the address bar and you'll figure it out.
To edit the search engines contained in the search.json.mozlz4
file using the mozlz4-edit extension, just click it's tool bar icon, then 'Open file' and point it to your search.json.mozlz4
file after you've made a backup copy. I'm not sure it's possible to sanitize the default search engine plugins which are packaged with Firefox any longer because the URL parameters discussed earlier are no longer contained in the file, but if you want to modify them in any way you must copy them and edit the copies being sure to give the copies different names since no two search plugins can share the same name.
Download preconfigured search plugins
If you'd rather avoid editing the search engine plugins, you can download a copy of my personal search.json.mozlz4
file that should work for Firefox version 57 and up ("up" meaning until the next time Mozilla decides to break everything again). The download contains the default engines which come with the U.S. English version of Firefox along with a pile of additional search engines i use. All in all there's around 35 search engine plugins.
Download: search.json.mozlz4.zip
Install: Backup your existing search.json.mozlz4
file(!), then extract the the one from the archive to your Firefox profile directory and restart Firefox.
When you use the search engines you'll notice that all the non-default ones are tagged as follows:
[I]
= indexing search engines that actively crawl the web in order to build their own index. These engines are essential for thwarting the censorship practiced by Google and Bing which is then passed on to all the meta engines that use their results including DuckDuckGo, Startpage, Qwant, Swisscows, Searx, MetaGer, etc..
[H]
= hybrid search engines which rely upon both 3rd parties (usually Bing) and index their own content.
[M]
= meta search engines which rely only upon 3rd parties, usually Bing.
[S]
= special purpose search engines which serve a specific purpose, such as for searching for scientific documents.
Any engines which are not tagged are the default search engines, all of which you can/should disable in Firefox's preferences (about:preferences#search).
You'll probably want to rearrange the search plugins from Firefox's preferences so each type is grouped together.
Removing Firefox system add-ons
In addition to search engine plugins, Mozilla also packages system add-ons with Firefox, installs them without your permission, and doesn't provide an easy way to remove or disable all of them. These system add-ons have been used for controversial purposes in the past. To remove them, see the 'System add-ons' section of the Firefox Configuration Guide for Privacy Freaks and Performance Buffs.
Resources
Special mention goes to 'Thorin-Oakenpants' (aka 'Pants') as well as the 'arkenfox' crew and their GitHub repository where they host an excellent privacy-centric user.js for Firefox and its derivatives, as well as an extensive Wiki full of valuable information.
Resources at 12bytes.org:
External resources:
- mozlz4-edit Firefox add-on by serj_kzv
- Measuring Search in Firefox | Firefox Data
- followonsearch/METRICS.md at master · mozilla/followonsearch · GitHub
- Firefox: How to remove all System Add-ons? | Techdows
- Addressing default search engine privacy · Issue #88 · arkenfox/user.js/arkenfox/user.js · GitHub
- list: Search Engines [for Wiki] · Issue #118 · arkenfox/user.js/arkenfox/user.js · GitHub
- Creating OpenSearch plugins for Firefox
- Mycroft Project: Search Engine Plugins - Firefox IE Chrome
- The Ultimate Guide to the Google Search Parameters
- 5 Best Search Engines That Respect Your Privacy - BestVPN.com
- Duck Duck Go: Illusion of Privacy
- Neat URL :: Add-ons for Firefox
- User.js file - MozillaZine Knowledge Base
- Whoogle Search
Recent changes
18-Nov-2022
- uploaded a fresh search.json.mozlz4 file
- corrected some links
- minor edits