Reputation: 4009

How to extract article content from a website/blog

I'm trying to write a generic function for extracting article text from blog posts and websites.

A few simplified examples I'd like to be able to process:

Random website:

...
<div class="readAreaBox" id="readAreaBox">
  <h1 itemprop="headline">title</h1>
  <div class="chapter_update_time">time</div>
  <div class="p" id="chapterContent">article text</div>
</div>
...

Wordpress:

<div id="main" class="site-main">
  <div id="primary" class="site-content" role="main">
    <div id="content" class="site-content" role="main">
      <article id="post-1234" class="post-1234 post type-post">
        <div class="entry-meta clear">..</div>
        <h1 class="entry-title">title</h1>
        <div class="entry-content clear">
          article content
          <div id="jp-post-flair" class="sharedaddy">sharing links</div>
        </div>
      </article>
    </div>
  </div>
</div>

Blogspot:

<div id="content">
  ...
  <div class="main" id="main">
    <div class="post hentry">
      <h3 class="post-title">title</h3>
      <div class="post-header">...</div>
      <div class="post-body">article content</div>
      <div class="post-footer">...</div>
    </div>
  </div>
</div>

What I came up with (doc is a Nokogiri::HTML::Document):

def fetch_content
  html = ''
  ['#content', '#main', 'article', '.post-body', '.entry-content', '#chapterContent'].each do |css|
    candidate = doc.css(css).to_html
    html = [html, candidate].select(&:present?).sort_by(&:length).first
  end
  self.content = html
end

It works relatively well for the examples I tested with but it still leaves some sharing and navigation links plus it won't work if a page uses more cryptic class names.

Is there a better way to do this?

Upvotes: 3