Code_Journey_4_Fun
Code_Journey_4_Fun

Reputation: 51

How to read multiple XML files then output to multiple CSV files with the same XML filenames

I am trying to parse multiple XML files then output them into CSV files to list out the proper rows and columns.

I was able to do so by processing one file at a time by defining the filename, and specifically output them into a defined output file name:

File.open('H:/output/xmloutput.csv','w')

I would like to write into multiple files and make their name the same as the XML filenames without hard coding it. I tried doing it multiple ways but have had no luck so far.

Sample XML:

<?xml version="1.0" encoding="UTF-8"?>
<record:root>
<record:Dataload_Request>
    <record:name>Bob Chuck</record:name>
    <record:Address_Data>
        <record:Street_Address>123 Main St</record:Street_Address>
        <record:Postal_Code>12345</record:Postal_Code>
    </record:Address_Data>
    <record:Age>45</record:Age>
</record:Dataload_Request>
</record:root>

Here is what I've tried:

require 'nokogiri'
require 'set'

files = ''
input_folder = "H:/input"
output_folder = "H:/output"

if input_folder[input_folder.length-1,1] == '/'
   input_folder = input_folder[0,input_folder.length-1]
end

if output_folder[output_folder.length-1,1] != '/'
   output_folder = output_folder + '/'
end


files   = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file    = File.read(input_folder + '/' + files)
doc     = Nokogiri::XML(file)
record  = {} # hashes
keys    = Set.new
records = [] # array
csv     = ""

doc.traverse do |node| 
  value = node.text.gsub(/\n +/, '')
    if node.name != "text" # skip these nodes: if class isnt text then skip
      if value.length > 0 # skip empty nodes
        key = node.name.gsub(/wd:/,'').to_sym
        if key == :Dataload_Request && !record.empty?
          records << record
          record = {}
        elsif key[/^root$|^document$/]
          # neglect these keys
        else
          key = node.name.gsub(/wd:/,'').to_sym
          # in case our value is html instead of text
          record[key] = Nokogiri::HTML.parse(value).text
          # add to our key set only if not already in the set
          keys << key
        end
      end
    end
  end

# build our csv
File.open('H:/output/.*csv', 'w') do |file|
  file.puts %Q{"#{keys.to_a.join('","')}"}
  records.each do |record|
    keys.each do |key|
      file.write %Q{"#{record[key]}",}
    end
    file.write "\n"
  end
  print ''
  print 'output files ready!'
  print ''
end

I have been getting 'read memory': no implicit conversion of Array into String (TypeError) and other errors.

Upvotes: 1

Views: 623

Answers (2)

the Tin Man
the Tin Man

Reputation: 160551

Here's a quick peer-review of your code, something like you'd get in a corporate environment...

Instead of writing:

input_folder = "H:/input"

input_folder[input_folder.length-1,1] == '/' # => false

Consider doing it using the -1 offset from the end of the string to access the character:

input_folder[-1] # => "t"

That simplifies your logic making it more readable because it's lacking unnecessary visual noise:

input_folder[-1] == '/' # => false

See [] and []= in the String documentation.


This looks like a bug to me:

files   = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file    = File.read(input_folder + '/' + files)

files is an array of filenames. input_folder + '/' + files is appending an array to a string:

foo = ['1', '2'] # => ["1", "2"]
'/parent/' + foo # => 
# ~> -:9:in `+': no implicit conversion of Array into String (TypeError)
# ~>  from -:9:in `<main>'

How you want to deal with that is left as an exercise for the programmer.


doc.traverse do |node|

is icky because it sidesteps the power of Nokogiri being able to search for a particular tag using accessors. Very rarely do we need to iterate over a document tag by tag, usually only when we're peeking at its structure and layout. traverse is slower so use it as a very last resort.


length is nice but isn't needed when checking whether a string has content:

value = 'foo'
value.length > 0 # => true
value > '' # => true

value = ''
value.length > 0 # => false
value > '' # => false

Programmers coming from Java like to use the accessors but I like being lazy, probably because of my C and Perl backgrounds.


Be careful with sub and gsub as they don't do what you're thinking they do. Both expect a regular expression, but will take a string which they do a escape on before beginning their scan.

You're passing in a regular expression, which is OK in this case, but it could cause unexpected problems if you don't remember all the rules for pattern matching and that gsub scans until the end of the string:

foo = 'wd:barwd:' # => "wd:barwd:"
key = foo.gsub(/wd:/,'') # => "bar"

In general I recommend people think a couple times before using regular expressions. I've seen some gaping holes opened up in logic written by fairly advanced programmers because they didn't know what the engine was going to do. They're wonderfully powerful, but need to be used surgically, not as a universal solution.

The same thing happens with a string, because gsub doesn't know when to quit:

key = foo.gsub('wd:','') # => "bar"

So, if you're looking to change just the first instance use sub:

key = foo.sub('wd:','') # => "barwd:"

I'd do it a little differently though.

foo = 'wd:bar'

I can check to see what the first three characters are:

foo[0,3] # => "wd:"

Or I can replace them with something else using string indexing:

foo[0,3] = '' 
foo # => "bar"

There's more but I think that's enough for now.

Upvotes: 2

lacostenycoder
lacostenycoder

Reputation: 11196

You should use Ruby's CSV class. Also, you don't need to do any string matching or regex stuff. Use Nokogiri to target elements. If you know the node names in the XML will be consistent it should be pretty simple. I'm not exactly sure if this is the output you want, but this should get you in the right direction:

require 'nokogiri'
require 'csv'

def xml_to_csv(filename)
  xml_str = File.read(filename)
  xml_str.gsub!('record:','') # remove the record: namespace
  doc = Nokogiri::XML xml_str
  csv_filename = filename.gsub('.xml', '.csv')

  CSV.open(csv_filename, 'wb' ) do |row|
    row << ['name', 'street_address', 'postal_code', 'age']
    row << [
      doc.xpath('//name').text,
      doc.xpath('//Street_Address').text,
      doc.xpath('//Postal_Code').text,
      doc.xpath('//Age').text,
    ]
  end
end

# iterate over all xml files
Dir.glob('*.xml').each { |filename| xml_to_csv(filename) }

Upvotes: 1

Related Questions