Reputation: 4009
I'm trying to write a generic function for extracting article text from blog posts and websites.
A few simplified examples I'd like to be able to process:
Random website:
...
<div class="readAreaBox" id="readAreaBox">
<h1 itemprop="headline">title</h1>
<div class="chapter_update_time">time</div>
<div class="p" id="chapterContent">article text</div>
</div>
...
Wordpress:
<div id="main" class="site-main">
<div id="primary" class="site-content" role="main">
<div id="content" class="site-content" role="main">
<article id="post-1234" class="post-1234 post type-post">
<div class="entry-meta clear">..</div>
<h1 class="entry-title">title</h1>
<div class="entry-content clear">
article content
<div id="jp-post-flair" class="sharedaddy">sharing links</div>
</div>
</article>
</div>
</div>
</div>
Blogspot:
<div id="content">
...
<div class="main" id="main">
<div class="post hentry">
<h3 class="post-title">title</h3>
<div class="post-header">...</div>
<div class="post-body">article content</div>
<div class="post-footer">...</div>
</div>
</div>
</div>
What I came up with (doc is a Nokogiri::HTML::Document
):
def fetch_content
html = ''
['#content', '#main', 'article', '.post-body', '.entry-content', '#chapterContent'].each do |css|
candidate = doc.css(css).to_html
html = [html, candidate].select(&:present?).sort_by(&:length).first
end
self.content = html
end
It works relatively well for the examples I tested with but it still leaves some sharing and navigation links plus it won't work if a page uses more cryptic class names.
Is there a better way to do this?
Upvotes: 3
Views: 2987
Reputation: 197
You could also use a free article extraction API, for example:
diffbot.com
embed.ly
textracto.com
Some of them work quite good, and as I know there are all easy to integrate with Ruby.
Upvotes: 0
Reputation: 1892
Use rapar this gives the facility to write domain specific parser like wordpress.com, blogspot.com etc
Upvotes: 1
Reputation: 429
There is a gem called pismo that implements a couple of algorithms that attempts to extract article content.
There is a java library boilerpipe which you can interface from JRuby which extract textual content of a webpage.
Upvotes: 1