How do the search engines work?

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 5.00 out of 5)
Loading...

Many of us use search engines such as Google, Yahoo, Yandex and others. However, if all of us understand how the mechanism of the search engine works? Despite the fact that each search engine has its own characteristics in the ranking algorithms of search and in search engine results, the operating principles are common for all search engines.

If we consider the process of searching of some information on the Internet, we can divided it into the following steps: collection of information from the Internet sites, indexing the sites, a search for request and ranking of the results. Let us examine each of the stages individually.

how google works

Data collection (Crawling)

As soon as you start your site and give the robot of any search engine to understand that a new resource appeared in the Internet (with external links to your site, add it to the “add.url” of the search engines or with some other methods), the robot comes to you, starts to walk through the pages and collect data from them (it can be the text content, images, videos, etc.). This process is called data collection (or Crawling) and it can take place not only at start of a site. The robot makes the schedule for the site when it will come to check it the next time, the robot will check out the old information and will add new pages if they will be.

It’s important that the communication of the bot with your website is enjoyable for both parties. It is in your interest the bot won’t check your website for a long time (for not to overload the server once again). At the same time, you need the bot to collect all the data from all of the relevant pages. Making the crawling the fast process is in interest of the robot because it needs to check lots of websites. For making fast this process, you need to make sure that your website is available and has no problems with the navigation (robots don’t like much menu made with the help of Flash and JavaScript). You need also to make sure that there are no broken links at your website (or they show 404 Error); you mustn’t make bots to check pages that are available only for registered users.  You have to remember that there are limits for web-spiders: they are limited by the depth of penetration (level of the pages), and by the maximum size of the scanned text (usually 256kb). You can control the access of the crawler using the file robots.txt. The sitemap.xml is also can help bot to index right you website.

Indexing

Robot can walk on your website for a long time but it doesn’t mean that your website immediately appears in the search results. The pages of your website need to pass the process named indexing – drawing for each page the reverse (inverted) an index file. The index serves for quickly search, and it usually consists of a list of words from the text and information about them (text position, weight and others). Once the indexing of a site or of individual pages passed, they appear in the search engine results and can be found by the keywords present in the text. The indexing process usually happens quickly enough after the robot pull together information from your site.

Search for information

When searching, first, the search engine analyses the keyword or keyword phrase entered by a user (preprocessing of the request), in in which result the search engine calculates the weight for each keyword.

Further, the search is performed according to inverted index, all documents, that are most appropriate for a given request, are found in the collection (in the database search). By other words, the calculated similarity of the document and the request may be found by the following formula:

similatiry(Q,D) = SUM(wqk*wdk)

similatiry(Q,D) is the similarity of request/query Q to the document D;

wqk is a weight of every word in the request/query;

wdk is a weight of every word in the document.

Documents that are most similar to the query get the search results.

Ranking

After the most similar documents were selected from the main collection, they should be ranked, so that the results reflected in the top will be most useful for users. For this purpose a special formula of ranking is using, this formula has different views for different search engines, but for all of them the main factors of ranking are:

  • Weight of the pages (PageRank);
  • Domain authority;
  • Relevance of the query to the text;
  • Relevance of texts of external links to the request;
  • And other search ranking factors.

There is a simple formula for the ranking, which can be found in some articles of search engine optimizers:

Rа(x)=(m*Tа(x)+p*Lа(x))* F(PRa)

Rа(x) is the final relevance of request/query a to the document x;

Tа(x) is the relevance of the text (code) of the document and request x;

Lа(x) is the relevance of the text of the links from other documents leading to the document a to the request x;

PRа is the index of the authority of the page a, and the constant for x;

F(PRa) is monotonously decreasing function, where F(0)=1, we can assume that F(PRa) = (1+q*PRа);

m, p, q are some coefficients.

That is, we need to know that in the ranking of documents internal factors and external are both used. They can be divided into dependent on the request factors (the relevance of the text of the document or the links) and independent of the request factors. Of course, this formula gives a very general idea of the ranking algorithms of documents in the search results.

↓