• Home
  • How Search Engine Works?

How Search Engine Works?

July 16, 2020 admin 0 Comments

Search engines are an integral part of our daily lives.

Most of us are familiar with ‘Google’. How to bake a cake? Where does my favorite actor live? Who wrote this book? What are the latest trends in fashion? And more questions are answered by our friendly ‘Google’.
Google is one of the many search engines available today which ‘dig’ around the Internet, and present us with the most relevant and valuable information.

Let us now understand, how do these search engines work?

Basically all search engines go through three stages:

  • Crawling
  • Indexing
  • Ranking and Retrieval

Crawling
This stage involves scanning the sites and obtaining information about everything that is contained there: page title, keywords, layout, pages that it links to – at a bare minimum.

This task is performed by special software robots, called “spiders” or “crawlers”.

These robots usually start with the most heavily used servers and popular web pages. The link structure is very important to determine the route that these “crawlers” follow. The new links are followed next to find many interconnected documents, also revisiting the previous sites to check for newly made changes. A never-ending process.

Sometimes the “crawlers” give up, if the actual content is hidden many clicks away from the homepage.

Indexing
Once all the data has been assimilated, selected pieces of it are stored in huge storage facilities. We can relate in this way: we possess several number of books. Going through all of it is the crawling, and making a list of them, along with their authors and other related information is the indexing.

This example provides a small-scale view.

If we expand this assumption to books contained in all the libraries in this world, that pretty much explains the magnitude a search engine undertakes.

Ranking and Retrieval
Search engines are answer machines. Whenever we perform an online search, the search engines scour its database for the most relevant results. Also, it ranks these results based on the popularity of the websites. Relevance and popularity are the most important factors to be considered by these search engines to provide satisfactory performance.

Ranking algorithms differ for different search engines. An engine might assign a weight to each entry, relative to their appearance in the title, meta tags or the sub-headings.

The most basic algorithm uses the frequency of the keyword being searched. This, however, led to something called “keyword stuffing”, where the pages are mostly filled with nonsense as long as it includes the keyword.
This gave way to the concept based on linking – more popular sites would be linked more.

Meta Tags

Meta tags allow the owner of a page to specify key words and concepts under which the page will be indexed. This can be helpful, especially in cases in which the words on the page might have double or triple meanings — the meta tags can guide the search engine in choosing which of the several possible meanings for these words is correct. There is, however, a danger in over-reliance on meta tags, because a careless or unscrupulous page owner might add meta tags that fit very popular topics but have nothing to do with the actual contents of the page. To protect against this, spiders will correlate meta tags with page content, rejecting the meta tags that don’t match the words on the page.

All of this assumes that the owner of a page actually wants it to be included in the results of a search engine’s activities. Many times, the page’s owner doesn’t want it showing up on a major search engine, or doesn’t want the activity of a spider accessing the page. Consider, for example, a game that builds new, active pages each time sections of the page are displayed or new links are followed. If a Web spider accesses one of these pages, and begins following all of the links for new pages, the game could mistake the activity for a high-speed human player and spin out of control. To avoid situations like this, the robot exclusion protocol was developed. This protocol, implemented in the meta-tag section at the beginning of a Web page, tells a spider to leave the page alone — to neither index the words on the page nor try to follow its links.

Building the Index

Once the spiders have completed the task of finding information on Web pages (and we should note that this is a task that is never actually completed — the constantly changing nature of the Web means that the spiders are always crawling), the search engine must store the information in a way that makes it useful. There are two key components involved in making the gathered data accessible to users:

  • The information stored with the data
  • The method by which the information is indexed

In the simplest case, a search engine could just store the word and the URL where it was found. In reality, this would make for an engine of limited use, since there would be no way of telling whether the word was used in an important or a trivial way on the page, whether the word was used once or many times or whether the page contained links to other pages containing the word. In other words, there would be no way of building the ranking list that tries to present the most useful pages at the top of the list of search results.

To make for more useful results, most search engines store more than just the word and URL. An engine might store the number of times that the word appears on a page. The engine might assign a weight to each entry, with increasing values assigned to words as they appear near the top of the document, in sub-headings, in links, in the meta tags or in the title of the page. Each commercial search engine has a different formula for assigning weight to the words in its index. This is one of the reasons that a search for the same word on different search engines will produce different lists, with the pages presented in different orders.

Regardless of the precise combination of additional pieces of information stored by a search engine, the data will be encoded to save storage space. For example, the original Google paper describes using 2 bytes, of 8 bits each, to store information on weighting — whether the word was capitalized, its font size, position, and other information to help in ranking the hit. Each factor might take up 2 or 3 bits within the 2-byte grouping (8 bits = 1 byte). As a result, a great deal of information can be stored in a very compact form. After the information is compacted, it’s ready for indexing.

An index has a single purpose: It allows information to be found as quickly as possible. There are quite a few ways for an index to be built, but one of the most effective ways is to build a hash table. In hashing, a formula is applied to attach a numerical value to each word. The formula is designed to evenly distribute the entries across a predetermined number of divisions. This numerical distribution is different from the distribution of words across the alphabet, and that is the key to a hash table’s effectiveness.

In English, there are some letters that begin many words, while others begin fewer. You’ll find, for example, that the “M” section of the dictionary is much thicker than the “X” section. This inequity means that finding a word beginning with a very “popular” letter could take much longer than finding a word that begins with a less popular one. Hashing evens out the difference, and reduces the average time it takes to find an entry. It also separates the index from the actual entry. The hash table contains the hashed number along with a pointer to the actual data, which can be sorted in whichever way allows it to be stored most efficiently. The combination of efficient indexing and effective storage makes it possible to get results quickly, even when the user creates a complicated search.

Building a Search

Searching through an index involves a user building a query and submitting it through the search engine. The query can be quite simple, a single word at minimum. Building a more complex query requires the use of Boolean operators that allow you to refine and extend the terms of the search.

The Boolean operators most often seen are:

  • AND – All the terms joined by “AND” must appear in the pages or documents. Some search engines substitute the operator “+” for the word AND.
  • OR – At least one of the terms joined by “OR” must appear in the pages or documents.
  • NOT – The term or terms following “NOT” must not appear in the pages or documents. Some search engines substitute the operator “-” for the word NOT.
  • FOLLOWED BY – One of the terms must be directly followed by the other.
  • NEAR – One of the terms must be within a specified number of words of the other.
  • Quotation Marks – The words between the quotation marks are treated as a phrase, and that phrase must be found within the document or file.

Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above.

#It services #startup Web #Web #tips #Web solution business It marketing SEO solution startup strategy tips

leave a comment

Would you like us to call you back?

×