leon
leon

Reputation: 1

How do I loop my ruby / nokogiri parser through several local HTML files and output the results to one CSV file?

I just wrote my first ruby program which is a simple parser. I plan to parse a set of 200 or so local .htm file with ruby and nokogiri and output everything to a single .csv file.

The local files are organized like this:

root\region_name1\city_name1.htm
root\region_name1\city_name2.htm
root\region_name1\city_name3.htm
root\region_name2\city_name1.htm
...

The relevant html source within above .htm files looks like this:

<div class="media-body">
    <h4 class="list-group-item-heading"><a ng-href="#/clubs/2001103" class="ng-binding" href="http://www.vereinssuche-nrw.de/#/clubs/2001103">DJK Arminia Eilendorf 1919 e. V.</a> <small ng-show="item.distance > 0" class="ng-binding" style="display: none;">0 km</small></h4>
        <div class="row">
            <div class="col-12 col-lg-6 ng-binding">
                <span ng-show="item.geoadresse.strasse" class="ng-binding">Ulmenstraße 12<br></span>52080 Aachen<br>
                <a ng-href="tel:0241 551424" ng-show="item.telefon" class="ng-binding" href="unsafe:tel:0241 551424">Tel.: 0241 551424<br></a>
                <a ng-href="http://www.DJK-Arminia-Eilendorf.de" ng-show="item.webseite" target="_blank" class="ng-binding" href="http://www.djk-arminia-eilendorf.de/">http://www.DJK-Arminia-Eilendorf.de</a>
            </div>
                <div class="col-lg-6 col-12 visible-lg event-list">
                    <b>Veranstaltungen</b>
                    <!-- ngRepeat: event in item.veranstaltungen | limitTo:3 -->
                <div ng-show="item.veranstaltungen.length == 0" class="text-muted">Keine Veranstaltungen angekündigt.</div>
            <div>
        </div>
</div>

My ruby code works fine for a single .htm file and parses / extracts the data I need via XPath. Instead of parsing every file and merging the output.csv files manually for all 200 .htm files I would like to automate the whole process but I cannot really figure out how to do this.

Here is my ruby code:

require 'rubygems'
require 'nokogiri'
require 'csv'

# define arrays including a dummy array which is needed for reasons i do not yet know :P
# remember that you can easily adapt this parser to suit your needs by defining additional variables
# and by adding additional xpath steps (doc.xpath...) below
name = Array.new
strasse = Array.new
plzort = Array.new
tel = Array.new
website = Array.new
dummy = Array.new

doc = Nokogiri::HTML(open("aachen.htm"))
puts doc.class   # => Nokogiri::HTML::Document

# search elements via xpath and collect contents in arrays
name = doc.xpath("//div/h4/a").collect {|node| node.text.strip}
strasse = doc.xpath("//div/span[contains(@ng-show,'item.geoadresse.strasse')]").collect {|node| node.text.strip}
plzort = doc.xpath("//div[@id='searchResults']/div/div/div/div/div[1]/text()").collect {|node| node.text.strip}
tel = doc.xpath("//div/a[contains(@ng-show,'item.telefon')]").collect {|node| node.text.strip}
website = doc.xpath("//div/a[contains(@ng-show,'item.webseite')]").collect {|node| node.text.strip}
dummy = doc.xpath("//*[@id='searchResults']/div[39]/div/div/div/div[1]/br").collect {|node| node.text.strip}
plzort.delete("")

# generate CSV file output.csv and force UTF-8
CSV.open("output.csv", "wb:UTF-8") do |csv|
    # prepopulate CSV file with column headings
    csv << ["name", "strasse", "plzort", "tel", "website", "dummy"]
    # repeat extraction process until name array returns nothing i.e. no more elements on page
    until name.empty?
        # write everything to CSV file
        csv << [name.shift, strasse.shift, plzort.shift, tel.shift, website.shift, dummy.shift]
  end
end

I have read through the ruby and nokogiri documentation but alas, I have no idea how to proceed.

Upvotes: 0

Views: 1147

Answers (2)

the Tin Man
the Tin Man

Reputation: 160551

Here's how I'd write sections of your code:

name = Array.new
strasse = Array.new
plzort = Array.new
tel = Array.new
website = Array.new
dummy = Array.new

can be written more clearly like:

name = []
strasse = []
plzort = []
tel = []
website = []
dummy = []

But, it's not necessary to initialize the variables in Ruby. Instead, assign directly to them...

name = doc.xpath("//div/h4/a").collect {|node| node.text.strip}
strasse = doc.xpath("//div/span[contains(@ng-show,'item.geoadresse.strasse')]").collect {|node| node.text.strip}
plzort = doc.xpath("//div[@id='searchResults']/div/div/div/div/div[1]/text()").collect {|node| node.text.strip}
tel = doc.xpath("//div/a[contains(@ng-show,'item.telefon')]").collect {|node| node.text.strip}
website = doc.xpath("//div/a[contains(@ng-show,'item.webseite')]").collect {|node| node.text.strip}
dummy = doc.xpath("//*[@id='searchResults']/div[39]/div/div/div/div[1]/br").collect {|node| node.text.strip}

would do that, but it's inelegant and wasteful. Instead, use something like this:

name, strasse, plzort, tel, website, dummy = [
  "//div/h4/a"
  "//div/span[contains(@ng-show,'item.geoadresse.strasse')]"
  "//div[@id='searchResults']/div/div/div/div/div[1]/text()"
  "//div/a[contains(@ng-show,'item.telefon')]"
  "//div/a[contains(@ng-show,'item.webseite')]"
  "//*[@id='searchResults']/div[39]/div/div/div/div[1]/br"
].map { |s|
  doc.xpath(s).collect {|node| node.text.strip}
}

Your XPaths become data in an array that you iterate over, performing the same operation each time. It makes for easier to understand and maintain code.

plzort.delete("")

won't do what you think it will. When plzort is assigned, it will be a NodeSet, which doesn't know how to delete(""):

plzort = doc.xpath('//bar')
plzort.delete("") # => 
# ~> -:9:in `delete': node must be a Nokogiri::XML::Node or Nokogiri::XML::Namespace (ArgumentError)
# ~>  from -:9:in `<main>'

Upvotes: 1

ply
ply

Reputation: 1141

Probably the easiest approach would be if you could move all of the files into a single directory. Then you could just loop through the entries in that one directory using Dir.foreach, and change your current script a bit to append results to the output file.

Assuming your script works now for one file, once you have a loop moving through all of the files in a directory, replace the hardcoded filename with the iterator variable name, and change the mode on your output file from "wb" (write) to "ab" (append)

Dir.foreach('root\region_name1') do |file|
   name = Array.new
   strasse = Array.new
   plzort = Array.new
   tel = Array.new
   website = Array.new
   dummy = Array.new

   doc = Nokogiri::HTML(open("#{file}"))   #Instead of hardcoding filename use iterator variable.
   puts doc.class   # => Nokogiri::HTML::Document

   # search elements via xpath and collect contents in arrays
   name = doc.xpath("//div/h4/a").collect {|node| node.text.strip}
   strasse = doc.xpath("//div/span[contains(@ng-show,'item.geoadresse.strasse')]").collect {|node|      node.text.strip}
   plzort = doc.xpath("//div[@id='searchResults']/div/div/div/div/div[1]/text()").collect {|node|   node.text.strip}
   tel = doc.xpath("//div/a[contains(@ng-show,'item.telefon')]").collect {|node| node.text.strip}
   website = doc.xpath("//div/a[contains(@ng-show,'item.webseite')]").collect {|node| node.text.strip}
   dummy = doc.xpath("//*[@id='searchResults']/div[39]/div/div/div/div[1]/br").collect {|node| node.text.strip}
   plzort.delete("")

  # generate CSV file output.csv and force UTF-8
  CSV.open("output.csv", "ab:UTF-8") do |csv|          #Change to ab to append to output file instead of overwrite
  # prepopulate CSV file with column headings
  csv << ["name", "strasse", "plzort", "tel", "website", "dummy"]
  # repeat extraction process until name array returns nothing i.e. no more elements on page
  until name.empty?
    # write everything to CSV file
    csv << [name.shift, strasse.shift, plzort.shift, tel.shift, website.shift, dummy.shift]
  end
 end
end

If you have a lot of directories and can't move all of your .htm files into one place, the same logic would apply, but you would first have to loop through their parent directory, then loop through each of the .htm files in each subdirectory:

Dir.foreach("parent_directory") do |folder|
    Dir.foreach("#{folder}"} do |file|
       # insert script here
    end
end

The Dir and FileUtils modules are very useful for looping through files and folders.

Upvotes: 0

Related Questions