Reputation: 41
I have a directory containing over 100 html files. I need to extract only the contents inside <TITLE></TITLE>
and <BODY></BODY>
tags and then format them as:
TITLE, "BODY CONTENT" (That is one line per document)
It would be be beneficial if results from each file in the array can be written to 1 giant text file. I have found following command to format the document to one line:
grep '^[^<]' test.txt | tr -d '\n' > test.txt
Although no specific programming language is preferred, the following will be helpful if i need to modify it further: perl, shell(.sh), sed
Upvotes: 1
Views: 461
Reputation: 698
Here's something in Ruby using Nokogiri.
require 'rubygems' # This line isn't needed on Ruby 1.9
require 'nokogiri'
ARGV.each do |input_filename|
doc = Nokogiri::HTML(File.read(input_filename))
title, body = doc.title, doc.xpath('//body').inner_text
puts %Q(#{title}, "#{body}")
end
Save that to a .rb
file, for example extractor.rb
. Then you need to make sure Nokogiri is installed by running gem install nokogiri
.
Use this script like so:
ruby extractor.rb /path/to/yourhtmlfiles/*.html > out.txt
Note that I don't handle newlines in this script, but you seem to have that figured out.
UPDATE:
This time it strips newlines and beginning/ending spaces.
require 'rubygems' # This line isn't needed on Ruby 1.9
require 'nokogiri'
ARGV.each do |input_filename|
doc = Nokogiri::HTML(File.read(input_filename))
title, body = doc.title, doc.xpath('//body').inner_text.gsub("\n", '').strip
puts %Q(#{title}, "#{body}")
end
Upvotes: 2
Reputation: 21236
You could do this with C# and LINQ. A quick example of loading a file:
IDictionary<string, string> parsed = new Dictionary<string, string>();
foreach ( string file in Directory.GetFiles( @"your directory here" ) )
{
var html = XDocument.Load( "file path here" ).Element( "html" );
string title = html.Element( "title" ).Value;
string body = html.Element( "body" ).Value;
body = XElement.Parse( body ).ToString( SaveOptions.DisableFormatting );
parsed.Add( title, body );
}
using ( StreamWriter file = new StreamWriter( @"your file path") )
{
foreach ( KeyValuePair<string, string> pair in parsed )
{
file.WriteLine( string.Format( "{0}, \"{1}\"", pair.Key, pair.Value ) );
}
}
I haven't tested this particular chunk of code, but it should work. HTH.
EDIT: If you have the base directory path, you can use Directory.GetFiles()
to retrieve the file names in the directory.
Upvotes: 0