Reputation: 467

how does spider in a search engine works?

How does crawler or spider in a search engine works

Upvotes: 1

Answers (3)

alienCoder

Reputation: 1481

The world wide web is basically a connected directed graph of web documents,images,multimedia files etc. .Each node of the graph is a component of a web page-for example-a web page consists of image,text,video etc, all of them are linked.Crawler traverses the graph using Breadth First Search using links in web pages.

A crawler initially starts with one (or more) seed points.
It scans the webpage and explores the links in that page.
This process continues until all the graph is explored(some predefined constraint can be used to limit search depth).

Upvotes: 3

Loki

Reputation: 30920

Specifically, you need at least some of the following components:

Configuration: Needed to tell the crawler how, when and where to connect to documents; and how to connect to the underlying database/indexing system.
Connector: This will create the connections to a web page or a disk share or anything, really.
Memory: The pages already visited must be known to the crawler. This is usually stored in the index but it depends on the implementation and the needs. The content is also hashed for de-duplication and updates validation purposes.
Parser/Converter: Needed to be able to understand the content of a document and extract meta-data. Will convert the extracted data to a format usable by the underlying database system.
Indexer: Will push the data and meta-data to an database/indexing system.
Scheduler: Will plan runs of the crawler. Might need to handle a large number of running crawlers at the same time and take into consideration what is currently being done.
Connection algorithm: When the parser finds links to other documents, it is needed to analyse when, how, and where the next connections must be made. Also, some indexing algorithm take into consideration the page connection graphs so it might be needed to store and sort information related to that.
Policy Management: Some sites requires crawlers to respect certain policies (robots.txt for example).
Security/User Management: The crawler might need to be able to login in some system to access data.
Content compilation/execution: The crawler might need to execute certain things to be able to access what's inside, like applets/plugins.

Crawlers needs to be efficient at working together from different starting points, speed, memory usage and using a high number of threads/processes. I/O is key.

Upvotes: 3

aioobe

Reputation: 420951

From How Stuff Works

How does any spider start its travels over the Web? The usual starting points are lists of heavily used servers and very popular pages. The spider will begin with a popular site, indexing the words on its pages and following every link found within the site. In this way, the spidering system quickly begins to travel, spreading out across the most widely used portions of the Web.

Upvotes: 0

how does spider in a search engine works?

Answers (3)

Related Questions