tsukugiri
tsukugiri

Reputation: 11

Opening multiple html files & outputting to .txt with Nokogiri

Just wondering if these two functions are to be done using Nokogiri or via more basic Ruby commands.

require 'open-uri'
require 'nokogiri'
require "net/http"
require "uri"

doc = Nokogiri.parse(open("example.html"))

doc.xpath("//meta[@name='author' or @name='Author']/@content").each do |metaauth|
puts "Author: #{metaauth}"
end

doc.xpath("//meta[@name='keywords' or @name='Keywords']/@content").each do |metakey|
puts "Keywords: #{metakey}"
end

etc...

Question 1: I'm just trying to parse a directory of .html documents, get the information from the meta html tags, and output the results to a text file if possible. I tried a simple *.html wildcard replacement, but that didn't seem to work (at least not with Nokogiri.parse(open()) maybe it works with ::HTML or ::XML)

Question 2: But more important, is it possible to output all of those meta content outputs into a text file to replace the puts command?

Also forgive me if the code is overly complicated for the simple task being performed, but I'm a little new to Nokogiri / xpath / Ruby.

Thanks.

Upvotes: 0

Views: 483

Answers (2)

Phrogz
Phrogz

Reputation: 303168

You can output to a file like so:

File.open('results.txt','w') do |file|
  file.puts "output"   # See http://ruby-doc.org/core-2.1.2/IO.html#method-i-puts
end

Alternatively, you could do something like:

authors = doc.xpath("//meta[@name='author' or @name='Author']/@content")
keywrds = doc.xpath("//meta[@name='keywords' or @name='Keywords']/@content")
results = authors.map{ |x| "Author: #{x}"   }.join("\n") +
          keywrds.map{ |x| "Keywords: #{x}" }.join("\n")
File.open('results.txt','w'){ |f| f << results }

Upvotes: 0

utwang
utwang

Reputation: 1484

I have a code similar.
Please refer to:

module MyParser
  HTML_FILE_DIR = `your html file dir`
  def self.run(options = {})
    file_list = Dir.entries(HTML_FILE_DIR).reject { |f| f =~ /^\./ }

    result = file_list.map do |file|
      html = File.read("#{HTML_FILE_DIR}/#{file}")
      doc = Nokogiri::HTML(html)
      parse_to_hash(doc)
    end
    write_csv(result)
  end

  def self.parse_to_hash(doc)
    array = []
    array << doc.css(`your select conditons`).first.content
    ... #add your selector code css or xpath

    array
  end

  def self.write_csv(result)
    ::CSV.open("`your out put file name`", 'w') do |csv|
      result.each { |row| csv << row }
    end
  end
end

MyParser.run

Upvotes: 0

Related Questions