Reputation: 1
I just wrote my first ruby program which is a simple parser. I plan to parse a set of 200 or so local .htm file with ruby and nokogiri and output everything to a single .csv file.
The local files are organized like this:
root\region_name1\city_name1.htm
root\region_name1\city_name2.htm
root\region_name1\city_name3.htm
root\region_name2\city_name1.htm
...
The relevant html source within above .htm files looks like this:
<div class="media-body">
<h4 class="list-group-item-heading"><a ng-href="#/clubs/2001103" class="ng-binding" href="http://www.vereinssuche-nrw.de/#/clubs/2001103">DJK Arminia Eilendorf 1919 e. V.</a> <small ng-show="item.distance > 0" class="ng-binding" style="display: none;">0 km</small></h4>
<div class="row">
<div class="col-12 col-lg-6 ng-binding">
<span ng-show="item.geoadresse.strasse" class="ng-binding">Ulmenstraße 12<br></span>52080 Aachen<br>
<a ng-href="tel:0241 551424" ng-show="item.telefon" class="ng-binding" href="unsafe:tel:0241 551424">Tel.: 0241 551424<br></a>
<a ng-href="http://www.DJK-Arminia-Eilendorf.de" ng-show="item.webseite" target="_blank" class="ng-binding" href="http://www.djk-arminia-eilendorf.de/">http://www.DJK-Arminia-Eilendorf.de</a>
</div>
<div class="col-lg-6 col-12 visible-lg event-list">
<b>Veranstaltungen</b>
<!-- ngRepeat: event in item.veranstaltungen | limitTo:3 -->
<div ng-show="item.veranstaltungen.length == 0" class="text-muted">Keine Veranstaltungen angekündigt.</div>
<div>
</div>
</div>
My ruby code works fine for a single .htm file and parses / extracts the data I need via XPath. Instead of parsing every file and merging the output.csv files manually for all 200 .htm files I would like to automate the whole process but I cannot really figure out how to do this.
Here is my ruby code:
require 'rubygems'
require 'nokogiri'
require 'csv'
# define arrays including a dummy array which is needed for reasons i do not yet know :P
# remember that you can easily adapt this parser to suit your needs by defining additional variables
# and by adding additional xpath steps (doc.xpath...) below
name = Array.new
strasse = Array.new
plzort = Array.new
tel = Array.new
website = Array.new
dummy = Array.new
doc = Nokogiri::HTML(open("aachen.htm"))
puts doc.class # => Nokogiri::HTML::Document
# search elements via xpath and collect contents in arrays
name = doc.xpath("//div/h4/a").collect {|node| node.text.strip}
strasse = doc.xpath("//div/span[contains(@ng-show,'item.geoadresse.strasse')]").collect {|node| node.text.strip}
plzort = doc.xpath("//div[@id='searchResults']/div/div/div/div/div[1]/text()").collect {|node| node.text.strip}
tel = doc.xpath("//div/a[contains(@ng-show,'item.telefon')]").collect {|node| node.text.strip}
website = doc.xpath("//div/a[contains(@ng-show,'item.webseite')]").collect {|node| node.text.strip}
dummy = doc.xpath("//*[@id='searchResults']/div[39]/div/div/div/div[1]/br").collect {|node| node.text.strip}
plzort.delete("")
# generate CSV file output.csv and force UTF-8
CSV.open("output.csv", "wb:UTF-8") do |csv|
# prepopulate CSV file with column headings
csv << ["name", "strasse", "plzort", "tel", "website", "dummy"]
# repeat extraction process until name array returns nothing i.e. no more elements on page
until name.empty?
# write everything to CSV file
csv << [name.shift, strasse.shift, plzort.shift, tel.shift, website.shift, dummy.shift]
end
end
I have read through the ruby and nokogiri documentation but alas, I have no idea how to proceed.
Upvotes: 0
Views: 1147
Reputation: 160551
Here's how I'd write sections of your code:
name = Array.new
strasse = Array.new
plzort = Array.new
tel = Array.new
website = Array.new
dummy = Array.new
can be written more clearly like:
name = []
strasse = []
plzort = []
tel = []
website = []
dummy = []
But, it's not necessary to initialize the variables in Ruby. Instead, assign directly to them...
name = doc.xpath("//div/h4/a").collect {|node| node.text.strip}
strasse = doc.xpath("//div/span[contains(@ng-show,'item.geoadresse.strasse')]").collect {|node| node.text.strip}
plzort = doc.xpath("//div[@id='searchResults']/div/div/div/div/div[1]/text()").collect {|node| node.text.strip}
tel = doc.xpath("//div/a[contains(@ng-show,'item.telefon')]").collect {|node| node.text.strip}
website = doc.xpath("//div/a[contains(@ng-show,'item.webseite')]").collect {|node| node.text.strip}
dummy = doc.xpath("//*[@id='searchResults']/div[39]/div/div/div/div[1]/br").collect {|node| node.text.strip}
would do that, but it's inelegant and wasteful. Instead, use something like this:
name, strasse, plzort, tel, website, dummy = [
"//div/h4/a"
"//div/span[contains(@ng-show,'item.geoadresse.strasse')]"
"//div[@id='searchResults']/div/div/div/div/div[1]/text()"
"//div/a[contains(@ng-show,'item.telefon')]"
"//div/a[contains(@ng-show,'item.webseite')]"
"//*[@id='searchResults']/div[39]/div/div/div/div[1]/br"
].map { |s|
doc.xpath(s).collect {|node| node.text.strip}
}
Your XPaths become data in an array that you iterate over, performing the same operation each time. It makes for easier to understand and maintain code.
plzort.delete("")
won't do what you think it will. When plzort
is assigned, it will be a NodeSet, which doesn't know how to delete("")
:
plzort = doc.xpath('//bar')
plzort.delete("") # =>
# ~> -:9:in `delete': node must be a Nokogiri::XML::Node or Nokogiri::XML::Namespace (ArgumentError)
# ~> from -:9:in `<main>'
Upvotes: 1
Reputation: 1141
Probably the easiest approach would be if you could move all of the files into a single directory. Then you could just loop through the entries in that one directory using Dir.foreach
, and change your current script a bit to append results to the output file.
Assuming your script works now for one file, once you have a loop moving through all of the files in a directory, replace the hardcoded filename with the iterator variable name, and change the mode on your output file from "wb"
(write) to "ab"
(append)
Dir.foreach('root\region_name1') do |file|
name = Array.new
strasse = Array.new
plzort = Array.new
tel = Array.new
website = Array.new
dummy = Array.new
doc = Nokogiri::HTML(open("#{file}")) #Instead of hardcoding filename use iterator variable.
puts doc.class # => Nokogiri::HTML::Document
# search elements via xpath and collect contents in arrays
name = doc.xpath("//div/h4/a").collect {|node| node.text.strip}
strasse = doc.xpath("//div/span[contains(@ng-show,'item.geoadresse.strasse')]").collect {|node| node.text.strip}
plzort = doc.xpath("//div[@id='searchResults']/div/div/div/div/div[1]/text()").collect {|node| node.text.strip}
tel = doc.xpath("//div/a[contains(@ng-show,'item.telefon')]").collect {|node| node.text.strip}
website = doc.xpath("//div/a[contains(@ng-show,'item.webseite')]").collect {|node| node.text.strip}
dummy = doc.xpath("//*[@id='searchResults']/div[39]/div/div/div/div[1]/br").collect {|node| node.text.strip}
plzort.delete("")
# generate CSV file output.csv and force UTF-8
CSV.open("output.csv", "ab:UTF-8") do |csv| #Change to ab to append to output file instead of overwrite
# prepopulate CSV file with column headings
csv << ["name", "strasse", "plzort", "tel", "website", "dummy"]
# repeat extraction process until name array returns nothing i.e. no more elements on page
until name.empty?
# write everything to CSV file
csv << [name.shift, strasse.shift, plzort.shift, tel.shift, website.shift, dummy.shift]
end
end
end
If you have a lot of directories and can't move all of your .htm files into one place, the same logic would apply, but you would first have to loop through their parent directory, then loop through each of the .htm files in each subdirectory:
Dir.foreach("parent_directory") do |folder|
Dir.foreach("#{folder}"} do |file|
# insert script here
end
end
The Dir and FileUtils modules are very useful for looping through files and folders.
Upvotes: 0