Many of us use search engines such as Google, Yahoo, Yandex and others. However, if all of us understand how the mechanism of the search engine works? Despite the fact that each search engine has its own characteristics in the ranking algorithms of search and in search engine results, the operating principles are common for all search engines.
If we consider the process of searching of some information on the Internet, we can divided it into the following steps: collection of information from the Internet sites, indexing the sites, a search for request and ranking of the results. Let us examine each of the stages individually.
Data collection (Crawling)
As soon as you start your site and give the robot of any search engine to understand that a new resource appeared in the Internet (with external links to your site, add it to the “add.url” of the search engines or with some other methods), the robot comes to you, starts to walk through the pages and collect data from them (it can be the text content, images, videos, etc.). This process is called data collection (or Crawling) and it can take place not only at start of a site. The robot makes the schedule for the site when it will come to check it the next time, the robot will check out the old information and will add new pages if they will be.
Robot can walk on your website for a long time but it doesn’t mean that your website immediately appears in the search results. The pages of your website need to pass the process named indexing – drawing for each page the reverse (inverted) an index file. The index serves for quickly search, and it usually consists of a list of words from the text and information about them (text position, weight and others). Once the indexing of a site or of individual pages passed, they appear in the search engine results and can be found by the keywords present in the text. The indexing process usually happens quickly enough after the robot pull together information from your site.
Search for information
When searching, first, the search engine analyses the keyword or keyword phrase entered by a user (preprocessing of the request), in in which result the search engine calculates the weight for each keyword.
Further, the search is performed according to inverted index, all documents, that are most appropriate for a given request, are found in the collection (in the database search). By other words, the calculated similarity of the document and the request may be found by the following formula:
similatiry(Q,D) = SUM(wqk*wdk)
similatiry(Q,D) is the similarity of request/query Q to the document D;
wqk is a weight of every word in the request/query;
wdk is a weight of every word in the document.
Documents that are most similar to the query get the search results.
After the most similar documents were selected from the main collection, they should be ranked, so that the results reflected in the top will be most useful for users. For this purpose a special formula of ranking is using, this formula has different views for different search engines, but for all of them the main factors of ranking are:
- Weight of the pages (PageRank);
- Domain authority;
- Relevance of the query to the text;
- Relevance of texts of external links to the request;
- And other search ranking factors.
There is a simple formula for the ranking, which can be found in some articles of search engine optimizers:
Rа(x) is the final relevance of request/query a to the document x;
Tа(x) is the relevance of the text (code) of the document and request x;
Lа(x) is the relevance of the text of the links from other documents leading to the document a to the request x;
PRа is the index of the authority of the page a, and the constant for x;
F(PRa) is monotonously decreasing function, where F(0)=1, we can assume that F(PRa) = (1+q*PRа);
m, p, q are some coefficients.
That is, we need to know that in the ranking of documents internal factors and external are both used. They can be divided into dependent on the request factors (the relevance of the text of the document or the links) and independent of the request factors. Of course, this formula gives a very general idea of the ranking algorithms of documents in the search results.