Andrew
Andrew

Reputation: 6394

Any ideas on how to identify the main content of the page?

if you had to identify the main text of the page (e.g. on a blog page to identify the post's content) what would you do? What do you think is the simplest way to do it?

  1. Get the page content with cURL
  2. Maybe use a DOM parser to identify the elements of the page

Upvotes: 9

Views: 4718

Answers (7)

Gregory Ostermayr
Gregory Ostermayr

Reputation: 1121

I've ported the original boilerpipe java code into a pure ruby implementation Ruby Boilerpipe also a Jruby version wrapping the original Java code Jruby Boilerpipe

Upvotes: 0

David L-R
David L-R

Reputation: 197

Recently I faced the same problem. I developed a news article scraper and I had to detect the main textual content of the article pages. Many news sites are displaying lots of other textual content beside the "main article" (e.g 'read next', 'you might be interested in'). My first approach was to collect all text between <p> tags. But this did't work because there were news sites that used the <p> for other elements like navigation, 'read more', etc. too. Some time ago I stumbled on the Boilerpipe libary.

The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.

That sounded like the perfect solution for my problem, but it wasn't. It failed at many news sites, because it was often not able to parse the whole text of the news article. I don't know why, but think that the boilerpipe algorithm can't deal with badly written html. So in many cases it just returned an empty string and not the main content of the news article.

After this bad experience I tried to develop my own "article text extractor" algorithm. The main idea was to split the html into different depths, for example:

<html>  
<!-- depth: 1 -->
<nav>
  <!-- depth: 2 -->
   <ul>
      <!-- depth: 3 -->
      <li><a href="/mhh">Site<!-- depth: 5 --></a></li>
      <li><a href="/bla">Site<!--- depth: 5 ---></a></li>
  </ul>
</nav>
<div id='text'>
  <!--- depth: 2 --->
  <p>Thats the main content...<!-- depth: 3 --></p>
  <p>main content, bla, bla bla ... <!-- depth: 3 --></p>
  <p>bla bla bla interesting bla bla! <!-- depth: 3 --></p>
  <p>whatever, bla... <!-- depth: 3 --></p>
</div>

</html>

As you can see, to filer out the surplus "clutter" with this algorithm, things like navigation elements, "you may like" sections, etc. must be on a different depth than the main content. Or in other words: the surplus "clutter" must be described with more (or less) html tags than the main textual content.

  1. Calculate the depth of every html element.
  2. Find the depth with the highest amount of textual content.
  3. Select all textual content with this depth

To proof this concept I wrote a Ruby script, which works out good, with most of the news sites. In addition to the Ruby script I also developed the textracto.com api which you can use for free.

Greetings, David

Upvotes: 1

David J.
David J.

Reputation: 32715

You might consider:

  • Boilerpipe: "The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page. The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings."
  • Ruby Readability: "Ruby Readability is a tool for extracting the primary readable content of a webpage. It is a Ruby port of arc90's readability project."
  • The Readability API: "If you'd like access to the Readability parser directly, the Content API is available upon request. Contact us if you're interested."

Upvotes: 2

Rune
Rune

Reputation: 8380

It depends very much on the page. Do you know anything about the page's structure beforehand? If you are in luck, it might provide an RSS feed that you could use or it might be marked up with some of the new HTML5 tags like <article>, <section> etc. (which carry more semantic power than pre-HTML5 tags).

Upvotes: 0

yura
yura

Reputation: 14645

There are some framework that can archive this, one of them is http://code.google.com/p/boilerpipe/ which uses some statistics. Some features that can detect html block with main content:

  1. p, div tags
  2. amount of text inside/outside
  3. amount of links inside/outside (i.e remove munus)
  4. some css class names and id (frequntly those block have classes or ids with main, main_block, content e.t.c)
  5. relation between title and text inside content

Upvotes: 7

user336063
user336063

Reputation:

That's a pretty hard task but I would start by counting spaces inside of DOM elements. A tell tale sign of human-readable content is spaces and periods. Most articles seem to encapsulate the content in paragraph tags so you could look at all p tags with n spaces and at least one punctuation mark.

You could also use the amount of grouped paragraph tags inside an element.. So if a div has N paragraph children, it could very well be the content you're wanting to extract.

Upvotes: 12

Tieson T.
Tieson T.

Reputation: 21201

It seems like the best answer is "it depends". As in, it depends on how the site in question is marked up.

  1. If the author uses "common" tags, you could look for a container element ID'd as "content" or "main."
  2. If the author is using HTML5, you should in theory be able to query for the <article> element, if it's a page with only one "story" to tell.

Upvotes: 1

Related Questions