By Author – Prankul Sinha
INTRODUCTION:
Web search engine is the software which is designed to search the information in the worldwide web. The search results are generally presented in a line of results often referred to as search engine results pages. The search result is a mixture of web pages, images, videos etc. that website whos SEO (search engine optimization) are good will be shown in the 1st page of google search engine thats why SEO is very important for websites.
Before we study how the search engine works or how the search engine used to find the web pages we have to study some technical things which are:-
- Web crawler
- Robots.txt
- Meta tags
- Web crawler:
A Web crawler, sometimes called a spider, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing
Web search engines and some other sites use Web crawling or spidering software to update their web content or indices of others sites’ web content. Web crawlers copy pages for processing by a search engine which indexes the downloaded pages so users can search more efficiently.
- Robots.txt:
The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned. Robots are often used by search engines to categorize websites. Not all robots cooperate with the standard; email harvesters, malware, and robots that scan for security vulnerabilities may even start with the portions of the website where they have been told to stay out. The standard is different from but can be used in conjunction with, Sitemaps, a robot inclusion standard for websites.
- Meta tags:
Meta elements are tags used in HTML and XHTML documents to provide structured metadata about a Web page. They are part of a web page’s head section. Multiple Meta elements with different attributes can be used on the same page. Meta elements can be used to specify page description, keywords and any other metadata not provided through the other head elements and attributes.
Approach:
A search engine maintains the following processes in near real-time:
- Web crawling
- Indexing
- Searching
Web search engines get their information by web crawling from site to site. The “spider” checks for the standard filename robots.txt, addressed to it, before sending certain information back to be indexed depending on many factors, such as the titles, page content, JavaScript, Cascading Style Sheets , headings, as evidenced by the standard HTML markup of the informational content, or its metadata in HTML meta tags. No web crawler may actually crawl the entire reachable web. Due to infinite websites, spider traps, spam, and other exigencies of the real web, crawlers instead apply a crawl policy to determine when the crawling of a site should be deemed sufficient. Some sites are crawled exhaustively, while others are crawled only partially.
Indexing means associating words and other definable tokens found on web pages to their domain names and HTML-based fields. The associations are made in a public database, made available for web search queries. A query from a user can be a single word. The index helps find information relating to the query as quickly as possible. Some of the techniques for indexing, and caching are trade secrets, whereas web crawling is a straightforward process of visiting all sites on a systematic basis.
Between visits by the spider, the cached version of the page (some or all the content needed to render it) stored in the search engine working memory is quickly sent to an inquirer. If a visit is overdue, the search engine can just act as a web proxy instead. In this case, the page may differ from the search terms indexed The cached page holds the appearance of the version whose words were indexed, so a cached version of a page can be useful to the website when the actual page has been lost, but this problem is also considered a mild form of link rot.