Extracting content from html tags

Question

I have a directory containing over 100 html files. I need to extract only the contents inside and tags and then format them as:

TITLE, "BODY CONTENT" (That is one line per document)

It would be be beneficial if results from each file in the array can be written to 1 giant text file. I have found following command to format the document to one line:

grep '^[^<]' test.txt | tr -d ' ' > test.txt

Although no specific programming language is preferred, the following will be helpful if i need to modify it further: perl, shell(.sh), sed

Justin Workman · Accepted Answer

Here's something in Ruby using Nokogiri.

require 'rubygems' # This line isn't needed on Ruby 1.9
require 'nokogiri'

ARGV.each do |input_filename|
  doc = Nokogiri::HTML(File.read(input_filename))
  title, body = doc.title, doc.xpath('//body').inner_text
  puts %Q(#{title}, "#{body}")
end

Save that to a .rb file, for example extractor.rb. Then you need to make sure Nokogiri is installed by running gem install nokogiri.

Use this script like so:

ruby extractor.rb /path/to/yourhtmlfiles/*.html > out.txt

Note that I don't handle newlines in this script, but you seem to have that figured out.

UPDATE:

This time it strips newlines and beginning/ending spaces.

require 'rubygems' # This line isn't needed on Ruby 1.9
require 'nokogiri'

ARGV.each do |input_filename|
  doc = Nokogiri::HTML(File.read(input_filename))
  title, body = doc.title, doc.xpath('//body').inner_text.gsub("
", '').strip
  puts %Q(#{title}, "#{body}")
end

Extracting content from html tags

Answers (2)

Related Questions