AtulBha
AtulBha

Reputation: 41

Extracting content from html tags

I have a directory containing over 100 html files. I need to extract only the contents inside <TITLE></TITLE> and <BODY></BODY> tags and then format them as:

TITLE, "BODY CONTENT" (That is one line per document)

It would be be beneficial if results from each file in the array can be written to 1 giant text file. I have found following command to format the document to one line:

grep '^[^<]' test.txt | tr -d '\n' > test.txt

Although no specific programming language is preferred, the following will be helpful if i need to modify it further: perl, shell(.sh), sed

Upvotes: 1

Views: 461

Answers (2)

Justin Workman
Justin Workman

Reputation: 698

Here's something in Ruby using Nokogiri.

require 'rubygems' # This line isn't needed on Ruby 1.9
require 'nokogiri'

ARGV.each do |input_filename|
  doc = Nokogiri::HTML(File.read(input_filename))
  title, body = doc.title, doc.xpath('//body').inner_text
  puts %Q(#{title}, "#{body}")
end

Save that to a .rb file, for example extractor.rb. Then you need to make sure Nokogiri is installed by running gem install nokogiri.

Use this script like so:

ruby extractor.rb /path/to/yourhtmlfiles/*.html > out.txt

Note that I don't handle newlines in this script, but you seem to have that figured out.

UPDATE:

This time it strips newlines and beginning/ending spaces.

require 'rubygems' # This line isn't needed on Ruby 1.9
require 'nokogiri'

ARGV.each do |input_filename|
  doc = Nokogiri::HTML(File.read(input_filename))
  title, body = doc.title, doc.xpath('//body').inner_text.gsub("\n", '').strip
  puts %Q(#{title}, "#{body}")
end

Upvotes: 2

Tieson T.
Tieson T.

Reputation: 21236

You could do this with C# and LINQ. A quick example of loading a file:

    IDictionary<string, string> parsed = new Dictionary<string, string>();

    foreach ( string file in Directory.GetFiles( @"your directory here" ) )
    {
        var html = XDocument.Load( "file path here" ).Element( "html" );

        string title = html.Element( "title" ).Value;
        string body = html.Element( "body" ).Value;
        body = XElement.Parse( body ).ToString( SaveOptions.DisableFormatting );

        parsed.Add( title, body );
    }

    using ( StreamWriter file = new StreamWriter( @"your file path") )
    {
        foreach ( KeyValuePair<string, string> pair in parsed )
        {
            file.WriteLine( string.Format( "{0}, \"{1}\"", pair.Key, pair.Value ) );
        }
    }

I haven't tested this particular chunk of code, but it should work. HTH.

EDIT: If you have the base directory path, you can use Directory.GetFiles() to retrieve the file names in the directory.

Upvotes: 0

Related Questions