Yacov
Yacov

Reputation: 1070

Extract content from a page

I need to recognize content in a page - to do something as so http://www.alchemyapi.com/api/text/ (I need to get the HTML so I cant use this API)

What logic can I use to accomplish this? (Coding language is not matter)

Here what I did (with a good result) - needs a lot more fixes...

  1. Find the most text in page so don't have a breaking tags - ignore inline tags (span, b, etc...)
  2. Go up one level and count breaking tags (br, p, div, etc...)
  3. Go up another level and count tags
  4. Compare tags count from step 2 with step 3
  5. If there is a lot of different we stop here - if not we go to step 3

Upvotes: 0

Views: 304

Answers (3)

Alexander.It
Alexander.It

Reputation: 187

You need a parser for navigate the DOM, in the NuGet packages you can find some helpful parser tools like this

Upvotes: 0

bmargulies
bmargulies

Reputation: 100050

Look for the Boilerpipe library. It is a comprehensive solution.

Using the Boilerpipe library, you can specify the output as HTML. So you get the main content(the article) while still preserving its HTML tags.

Upvotes: 3

Vinay
Vinay

Reputation: 759

Another good alternative would be to use Goose.

It allows more fields(published date, author, main image in article and a few more) than Boilerpipe (title, content)

Upvotes: 2

Related Questions