wassimans
wassimans

Reputation: 8682

Search Engine without crawling?

Is there a way to collect web content in order to use it in a search engine without passing by the web crawling phase? Any alternative to web crawling?

Thanks

Upvotes: 3

Views: 1190

Answers (5)

Calmarius
Calmarius

Reputation: 19451

Well if you don't want to crawl, you can follow a wiki-like approach, where users can submit links to sites (with title, description and tags). So a collaborative link collection can be built.

To avoid spam a +/- system can be involved, to vote useful sites or tags up and useless ones down.

To avoid spammers mass voting SERPs you can weight votes by user reputation.

User reputation can be gained by submitting useful sites. Or somehow tracing usage patterns.

And considering other abuse patterns too.

Well, you got the point, I think.

As spammers gradually discover weaknesses of traditional search engines (see Google bomb, content scraper sites, etc.), a community based approach may work. But it would suffer severely from the cold start effect, and when community is small the system is easy to abuse and poison...

At least Wikipedia and Stack Exchange is not spammed to useless levels so far...

PS: http://xkcd.com/810/

Upvotes: 0

Varun Pathak
Varun Pathak

Reputation: 53

If you want to be updated with the latest content on pages, then you can use something like pubsubhubbub protocol to get push notifications for subscribed links. Or use paid services like superfeedr that make use of the same protocol.

Upvotes: 2

mt3
mt3

Reputation: 2784

Yes (and sort-of no).

:)

You can download existing data dumps from various websites (wikipedia, stackoverflow, etc.) and construct a partial index that way. It obviously won't be a complete index of the internet.

You could also use meta-search to construct your search engine. This is where you use the APIs of other search engines and use THEIR search results as the basis of your index. Examples include citosearch and opensearch. duckduckgo uses yahoo's boss api (and now yahoo uses bing...) as part of their search engine.

There are also real-time streaming APIs that you could use instead of crawling the web. Look at datasift as an example. There are lots more resources you could cleverly use and avoid/minimize crawling.

Upvotes: 3

T.J. Crowder
T.J. Crowder

Reputation: 1075437

No, to collect the content you have to...collect the content. :-)

Upvotes: 5

Upul Bandara
Upul Bandara

Reputation: 5958

directly or indirectly you have to crawl the web in order to get the content.

Upvotes: 1

Related Questions