Fan Phill
Fan Phill

Reputation: 143

How to discover common information blocks from multiple web pages of a same website?

It is a pattern recognition task in web crawler. The traditional crawler gets the data of the whole page. If there is any way to make the crawler a litter intelligence, like just to identify and capture the the information part.

Upvotes: 0

Views: 60

Answers (1)

Nikita Astrakhantsev
Nikita Astrakhantsev

Reputation: 4749

It is a research problem called wrapper induction or web data extraction. I don't know any library for this, but there are a lot of research papers (see below the list of good ones IMHO) and some research projects like DIADEM (their site contains list of publications as well).

Upvotes: 1

Related Questions