Reputation: 6394
if you had to identify the main text of the page (e.g. on a blog page to identify the post's content) what would you do? What do you think is the simplest way to do it?
Upvotes: 9
Views: 4718
Reputation: 1121
I've ported the original boilerpipe java code into a pure ruby implementation Ruby Boilerpipe also a Jruby version wrapping the original Java code Jruby Boilerpipe
Upvotes: 0
Reputation: 197
Recently I faced the same problem. I developed a news article scraper and I had to detect the main textual content of the article pages. Many news sites are displaying lots of other textual content beside the "main article" (e.g 'read next', 'you might be interested in'). My first approach was to collect all text between <p>
tags. But this did't work because there were news sites that used the <p>
for other elements like navigation, 'read more', etc. too. Some time ago I stumbled on the Boilerpipe libary.
The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.
That sounded like the perfect solution for my problem, but it wasn't. It failed at many news sites, because it was often not able to parse the whole text of the news article. I don't know why, but think that the boilerpipe algorithm can't deal with badly written html. So in many cases it just returned an empty string and not the main content of the news article.
After this bad experience I tried to develop my own "article text extractor" algorithm. The main idea was to split the html into different depths, for example:
<html>
<!-- depth: 1 -->
<nav>
<!-- depth: 2 -->
<ul>
<!-- depth: 3 -->
<li><a href="/mhh">Site<!-- depth: 5 --></a></li>
<li><a href="/bla">Site<!--- depth: 5 ---></a></li>
</ul>
</nav>
<div id='text'>
<!--- depth: 2 --->
<p>Thats the main content...<!-- depth: 3 --></p>
<p>main content, bla, bla bla ... <!-- depth: 3 --></p>
<p>bla bla bla interesting bla bla! <!-- depth: 3 --></p>
<p>whatever, bla... <!-- depth: 3 --></p>
</div>
</html>
As you can see, to filer out the surplus "clutter" with this algorithm, things like navigation elements, "you may like" sections, etc. must be on a different depth than the main content. Or in other words: the surplus "clutter" must be described with more (or less) html tags than the main textual content.
To proof this concept I wrote a Ruby script, which works out good, with most of the news sites. In addition to the Ruby script I also developed the textracto.com api which you can use for free.
Greetings, David
Upvotes: 1
Reputation: 32715
You might consider:
Upvotes: 2
Reputation: 8380
It depends very much on the page. Do you know anything about the page's structure beforehand? If you are in luck, it might provide an RSS feed that you could use or it might be marked up with some of the new HTML5 tags like <article>
, <section>
etc. (which carry more semantic power than pre-HTML5 tags).
Upvotes: 0
Reputation: 14645
There are some framework that can archive this, one of them is http://code.google.com/p/boilerpipe/ which uses some statistics. Some features that can detect html block with main content:
Upvotes: 7
Reputation:
That's a pretty hard task but I would start by counting spaces inside of DOM elements. A tell tale sign of human-readable content is spaces and periods. Most articles seem to encapsulate the content in paragraph tags so you could look at all p tags with n spaces and at least one punctuation mark.
You could also use the amount of grouped paragraph tags inside an element.. So if a div has N paragraph children, it could very well be the content you're wanting to extract.
Upvotes: 12
Reputation: 21201
It seems like the best answer is "it depends". As in, it depends on how the site in question is marked up.
<article>
element, if it's a page with only one "story" to tell.Upvotes: 1