Code_Journey_4_Fun
Code_Journey_4_Fun

Reputation: 51

XML to CSV ruby

I have multiple XML sample files and I would like to convert them in CSV, but there will be multiple of different attributes/nodes for different XML files therefore I dont want to hard code the different attributes. I would like the output to show column header as the first row, then each of the node/record being display vertically like a traditional column and row spreadsheet. Here is one the sample XML:

<?xml version="1.0" encoding="UTF-8"?>
<sd:root xmlns:wd="urn:com.sample/bsvc" sd:version="v31.0">
  <sd:Put_Job_Profile_Request sd:Add_Only="0">
    <sd:Job_Profile_Data>
      <sd:Job_Code>30000</sd:Job_Code>
      <sd:Effective_Date>1900-01-01</sd:Effective_Date>
      <sd:Job_Profile_Basic_Data>
        <sd:Job_Title>Chief Executive Officer</sd:Job_Title>
      </sd:Job_Profile_Basic_Data>
    </sd:Job_Profile_Data>
  </sd:Put_Job_Profile_Request>
  <sd:Put_Job_Profile_Request sd:Add_Only="0">
    <sd:Job_Profile_Data>
      <sd:Job_Code>30100</sd:Job_Code>
      <sd:Effective_Date>1900-01-01</sd:Effective_Date>
      <sd:Job_Profile_Basic_Data>
        <sd:Job_Title>Administrator Job Profile</sd:Job_Title>
      </sd:Job_Profile_Basic_Data>
    </sd:Job_Profile_Data>
  </sd:Put_Job_Profile_Request>
  <sd:Put_Job_Profile_Request sd:Add_Only="0">
    <sd:Job_Profile_Data>
      <sd:Job_Code>30200</sd:Job_Code>
      <sd:Effective_Date>1900-01-01</sd:Effective_Date>
      <sd:Job_Profile_Basic_Data>
        <sd:Inactive>0</sd:Inactive>
        <sd:Job_Title>Facilities &amp; Grounds Maintenance Attendant</sd:Job_Title>
        <sd:Include_Job_Code_in_Name>0</sd:Include_Job_Code_in_Name>
        <sd:Job_Profile_Private_Title>Maintenance Job Title</sd:Job_Profile_Private_Title>
        <sd:Job_Profile_Summary>Maintain cleanliness of the campus building throughout the day and fulfill special requests as needed.</sd:Job_Profile_Summary>
        <sd:Job_Description>&lt;p>Job Description&lt;b> rich text!&lt;/b>&lt;/p></sd:Job_Description>
        <sd:Additional_Job_Description>&lt;p>&lt;b>&lt;i>&lt;span class="emphasis-2">&lt;u>Additional&lt;/u>&lt;/span>&lt;/i>&lt;/b> Job Description&lt;b> rich text!&lt;/b>&lt;/p></sd:Additional_Job_Description>
        <sd:Work_Shift_Required>0</sd:Work_Shift_Required>
        <sd:Public_Job>1</sd:Public_Job>    
      </sd:Job_Profile_Basic_Data>
    </sd:Job_Profile_Data>
  </sd:Put_Job_Profile_Request>
  <sd:Put_Job_Profile_Request sd:Add_Only="0">
    <sd:Job_Profile_Data>
      <sd:Job_Code>30300</sd:Job_Code>
      <sd:Effective_Date>1900-01-01</sd:Effective_Date>
      <sd:Job_Profile_Basic_Data>
        <sd:Inactive>0</sd:Inactive>
        <sd:Job_Title>Sample_Job_Title</sd:Job_Title>
        <sd:Include_Job_Code_in_Name>0</sd:Include_Job_Code_in_Name>
        <sd:Job_Profile_Summary>Sample Job Profile Summary</sd:Job_Profile_Summary>
        <sd:Job_Description>Sample Job Description</sd:Job_Description>
        <sd:Additional_Job_Description>Sample Additional Job Description</sd:Additional_Job_Description>
        <sd:Work_Shift_Required>1</sd:Work_Shift_Required>
      </sd:Job_Profile_Basic_Data>
    </sd:Job_Profile_Data>
  </sd:Put_Job_Profile_Request>
</sd:root>

The code I used but come out incorrectly:

require 'csv'
require 'nokogiri'

file = File.read('jobProfile.xml')
doc = Nokogiri::XML(file)
a = []

CSV.open('xmloutput.csv', 'wb') do |csv|
  csv << doc.at('.').search('*').map(&:name)
  doc.search('.').each do |x|
    csv << x.search('*').map(&:text)
  end
end

Each set of record with its column headers and data being set horizontally. But I would like to iterate the data and keep one row of column headers. I am not sure how to do that without hard coding each of the attributes :/ Please help as I am still new to programming, and I have tried to find solution for a week :(

screenshot showing the csv output

Upvotes: 1

Views: 280

Answers (1)

peter
peter

Reputation: 42182

You need to first build an array of hashes and extract the keys as headers and then put the values in the right column, all nodes flattened to columns, ignoring the root and record keys.

Something like this

require 'nokogiri'
require 'set'

file    = File.read('jobProfile.xml')
doc     = Nokogiri::XML(file)
record  = {}
keys    = Set.new
records = []
csv     = ""

doc.traverse do |node| 
  value = node.text.gsub(/\n +/, '')
  if node.name
    if node.name != "text" # skip these nodes
      if value.length > 0 # skip empty nodes
        key = node.name.gsub(/sd:/,'').to_sym
        # if a new and not empty record, add to our records collection
        if key == :Job_Profile_Data && !record.empty?
          records << record
          record = {}
        elsif key[/Job_Profile|^root$|^document$/]
          # neglect these keys
        else
          key = node.name.gsub(/sd:/,'').to_sym
          # in case our value is html instead of text
          record[key] = Nokogiri::HTML.parse(value).text
          # add to our key set only if not allready in the set
          keys << key
        end
      end
    end
  end
end

# build our csv
File.open('./xmloutput.csv', 'w') do |file|
  file.puts %Q{"#{keys.to_a.join('","')}"}
  records.each do |record|
    keys.each do |key|
      file.write %Q{"#{record[key]}",}
    end
    file.write "\n"
  end
end

Which gives in our csv file the following

"Job_Code","Effective_Date","Job_Title","Inactive","Include_Job_Code_in_Name","Job_Description","Additional_Job_Description","Work_Shift_Required","Public_Job"
"30000","1900-01-01","Chief Executive Officer","","","","","","",
"30100","1900-01-01","Administrator Job Profile","","","","","","",
"30200","1900-01-01","Facilities & Grounds Maintenance Attendant","0","0","Job Description rich text!","Additional Job Description rich text!","0","1",
"30300","1900-01-01","Sample_Job_Title","0","0","Sample Job Description","Sample Additional Job Description","1","",

Upvotes: 1

Related Questions