robust
robust

Reputation:

building a web crawler

I'm currently developing a custom search engine with built-in web crawler. For some reason I'm not into multi-threading, thus so far my indexer was coded in single-threaded manner. Now I have a small dilemma with the crawler I'm building. Can anybody suggest which is better, crawl 1 page then index it, or crawl 1000+ page and cache, then index?

Upvotes: 1

Views: 1342

Answers (4)

Yogi
Yogi

Reputation: 2540

Not using threads is OK. However if you still want performance, you need to deal with Asynchronous IO. I would recommend checking out Boost.ASIO link text. Using Asynchronous IO will make your dilemma "irrelevant", as it would not matter. Also as a bonus, in future if you do decide to use threads, then its trivial to tell Boost.Asio to apply multuple threads to the problem.

Upvotes: 1

Benji York
Benji York

Reputation: 2050

Networks are slow (relative to the CPU). You will see a significant speed increase by parallelizing your crawler. Otherwise, your app will spend the majority of its time waiting on network IO to complete. You can either use multiple threads and blocking IO or a single thread with asynchronous IO.

Also, most indexing algorithms will perform better on batches of documents verses indexing one document at a time.

Upvotes: 4

Matthew Flaschen
Matthew Flaschen

Reputation: 285077

I would strongly suggest getting "in" to to multi-threading if you are serious about your crawler. Basically, you would want to have at least one indexer and at least one crawler (potentially multitudes for both) running at all times. Among other things, this minimizes start-up and shutdown overhead (e.g. initializing and freeing data structures).

Upvotes: 1

Sam Axe
Sam Axe

Reputation: 33738

Better? In terms of what? In terms of speed I can't forsee a noticable difference. In terms of robustness (recovering from a catastrophic failure) its probably better to index each page as you crawl it.

Upvotes: 1

Related Questions