appleII717
appleII717

Reputation: 368

Rails Hash.from_xml not giving expected results

Trying to process some XML that comes from an application called TeleForm. This is form scanning software and it grabs the data and puts it into XML. This is a snippet of the XML

<?xml version="1.0" encoding="ISO-8859-1"?>
<Records>
  <Record>
    <Field id="ImageFilename" type="string" length="14"><Value>00000022000000</Value></Field>
    <Field id="Criterion_1" type="number" length="2"><Value>3</Value></Field>
    <Field id="Withdrew" type="string" length="1"></Field>
  </Record>

  <Record>
    <Field id="ImageFilename" type="string" length="14"><Value>00000022000001</Value></Field>
    <Field id="Criterion_1" type="number" length="2"><Value>3</Value></Field>
    <Field id="Withdrew" type="string" length="1"></Field>
  </Record>
</Records>

I've dealt with this in an other system, probably using a custom parser we wrote. I figured it would be no problem in Rails, but I was wrong.

Parsing this with Hash.from_xml or from Nokogiri does not give me the results I expected, I get:

{"Records"=>{"Record"=>[{"Field"=>["", {"id"=>"Criterion_1", "type"=>"number", "length"=>"2", "Value"=>"3"}, ""]},
 {"Field"=>["", {"id"=>"Criterion_1", "type"=>"number", "length"=>"2", "Value"=>"3"}, ""]}]}}

After spending way too much time on this, I discovered if I gsub out the type and length attributes, I get what I expected (even if it is wrong! I only removed on the first record node).

{"Records"=>{"Record"=>[{"Field"=>[{"id"=>"ImageFilename", "Value"=>"00000022000000"}, 
{"id"=>"Criterion_1", "type"=>"number", "length"=>"2", "Value"=>"3"}, {"id"=>"Withdrew"}]}, 
{"Field"=>["", {"id"=>"Criterion_1", "type"=>"number", "length"=>"2", "Value"=>"3"}, ""]}]}}

Not being well versed in XML, I assume this style of XML using type and length attributes is trying to convert to the data types. In that case, I can understand why the "Withdrew" attribute showed up as empty, but don't understand why the "ImageFilename" was empty - it is a 14 character string.

I've got the work around with gsub, but is this invalid XML? Would adding a DTD (which TeleForm should have provided) give me different results?

EDIT

I'll provide a possible answer to my own question with some code as an edit. The code follows some of the features in the one answer I did receive from Mark Thomas, but I decided against Nokogiri for the following reasons:

An expanded version of the XML with one complete record

<?xml version="1.0" encoding="ISO-8859-1"?>
<Records>
  <Record>
    <Field id="ImageFilename" type="string" length="14"><Value>00000022000000</Value></Field>
    <Field id="DocID" type="string" length="15"><Value>731192AIINSC</Value></Field>
    <Field id="FormID" type="string" length="6"><Value>AIINSC</Value></Field>
    <Field id="Availability" type="string" length="18"><Value>M  T  W  H  F  S</Value></Field>
    <Field id="Criterion_1" type="number" length="2"><Value>3</Value></Field>
    <Field id="Criterion_2" type="number" length="2"><Value>3</Value></Field>
    <Field id="Criterion_3" type="number" length="2"><Value>3</Value></Field>
    <Field id="Criterion_4" type="number" length="2"><Value>3</Value></Field>
    <Field id="Criterion_5" type="number" length="2"><Value>3</Value></Field>
    <Field id="Criterion_6" type="number" length="2"><Value>3</Value></Field>
    <Field id="Criterion_7" type="number" length="2"><Value>3</Value></Field>
    <Field id="Criterion_8" type="number" length="2"><Value>3</Value></Field>
    <Field id="Criterion_9" type="number" length="2"><Value>3</Value></Field>
    <Field id="Criterion_10" type="number" length="2"><Value>3</Value></Field>
    <Field id="Criterion_11" type="number" length="2"><Value>0</Value></Field>
    <Field id="Criterion_12" type="number" length="2"><Value>0</Value></Field>
    <Field id="Criterion_13" type="number" length="2"><Value>0</Value></Field>
    <Field id="Criterion_14" type="number" length="2"><Value>0</Value></Field>
    <Field id="Criterion_15" type="number" length="2"><Value>0</Value></Field>
    <Field id="DayTraining" type="string" length="1"><Value>Y</Value></Field>
    <Field id="SaturdayTraining" type="string" length="1"></Field>
    <Field id="CitizenStageID" type="string" length="12"><Value>731192</Value></Field>
    <Field id="NoShow" type="string" length="1"></Field>
    <Field id="NightTraining" type="string" length="1"></Field>
    <Field id="Withdrew" type="string" length="1"></Field>
    <Field id="JobStageID" type="string" length="12"><Value>2292</Value></Field>
    <Field id="DirectHire" type="string" length="1"></Field>
  </Record>
</Records>

I am only experimenting with a workflow prototype to replace an aging system written in 4D and Active4D. This area of processing TeleForms data was implemented as a batch operation and it still may revert to that. I am just trying to merge some of the old viable concepts in a new Rails implementation. The XML files are on a shared server and will probably have to be moved into the web root and then some trigger set to process to files.

I am still in the defining stage, but my module/classes to handle the InterviewForm is looking like this and may change (with little error trapping, still trying to get into testing and my Ruby is not as good as it should be after playing with Rails for about 5 years!):

module Teleform::InterviewForm

  class Form < Prawn::Document
    # Not relevant to this question, but this class generates the forms from a Fillable PDF template and 
    # relavant Model(s) data.
    # These forms, when completed are what is processsed by TeleForms and produces the xml.
  end

  class RateForms
    attr_accessor  :records, :results

    def initialize(xml_path)
      fields = []
      xml = File.read(xml_path)
      # Hash.from_xml does not like a type of "string"
      hash = Hash.from_xml(xml.gsub(/type="string"/,'type="text"'))
      hash["Records"]["Record"].each do |record|
        #extract the field form each record
        fields << record["Field"]
      end
      @records = []
      fields.each do |field|
        #build the records for the form
        @records << Record.new(field)
      end
      @results = rate_records
    end

    def rate_records
      # not relevant to the qustions but this is where the data is processed and a bunch of stuff takes place
      return "Any errors"
    end
  end


  class Record
    attr_accessor(*[:image_filename, :doc_id, :form_id, :availability, :criterion_1, :criterion_2, 
      :criterion_3, :criterion_4, :criterion_5, :criterion_6, :criterion_7, :criterion_8, 
      :criterion_9, :criterion_10, :criterion_11, :criterion_12, :criterion_13, :criterion_14, :criterion_15, 
      :day_training, :saturday_training, :citizen_stage_id, :no_show, :night_training, :withdrew, :job_stage_id, :direct_hire])

    def initialize(fields)
      fields.each do |field|
        if field["type"] == "number"
          try("#{field["id"].underscore.to_sym}=", field["Value"].to_i)
        else
          try("#{field["id"].underscore.to_sym}=", field["Value"])
        end
      end
    end
  end

end

Upvotes: 2

Views: 1966

Answers (2)

nickl-
nickl-

Reputation: 8731

It appears XmlSimple (by maik) is better suited for this task then the unreliable and inconsistent Hash.from_xml implementation.

A port of the tried and tested perl module of the same name, which has several notable advantages.

  • It is consistent, whether you find one or many occurrences of a node
  • does not choke and garble the results
  • able te distinguish between attributes and node content.

Running the above same xml document through the parser:

XmlSimple.xml_in xml

Will produce the following result.

{"Record"=>
  [{"Field"=>
     [{"id"=>"ImageFilename", "type"=>"string", "length"=>"14", "Value"=>["00000022000000"]},
      {"id"=>"DocID", "type"=>"string", "length"=>"15", "Value"=>["731192AIINSC"]},
      {"id"=>"FormID", "type"=>"string", "length"=>"6", "Value"=>["AIINSC"]},
      {"id"=>"Availability", "type"=>"string", "length"=>"18", "Value"=>["M  T  W  H  F  S"]},
      {"id"=>"Criterion_1", "type"=>"number", "length"=>"2", "Value"=>["3"]},
      {"id"=>"Criterion_2", "type"=>"number", "length"=>"2", "Value"=>["3"]},
      {"id"=>"Criterion_3", "type"=>"number", "length"=>"2", "Value"=>["3"]},
      {"id"=>"Criterion_4", "type"=>"number", "length"=>"2", "Value"=>["3"]},
      {"id"=>"Criterion_5", "type"=>"number", "length"=>"2", "Value"=>["3"]},
      {"id"=>"Criterion_6", "type"=>"number", "length"=>"2", "Value"=>["3"]},
      {"id"=>"Criterion_7", "type"=>"number", "length"=>"2", "Value"=>["3"]},
      {"id"=>"Criterion_8", "type"=>"number", "length"=>"2", "Value"=>["3"]},
      {"id"=>"Criterion_9", "type"=>"number", "length"=>"2", "Value"=>["3"]},
      {"id"=>"Criterion_10", "type"=>"number", "length"=>"2", "Value"=>["3"]},
      {"id"=>"Criterion_11", "type"=>"number", "length"=>"2", "Value"=>["0"]},
      {"id"=>"Criterion_12", "type"=>"number", "length"=>"2", "Value"=>["0"]},
      {"id"=>"Criterion_13", "type"=>"number", "length"=>"2", "Value"=>["0"]},
      {"id"=>"Criterion_14", "type"=>"number", "length"=>"2", "Value"=>["0"]},
      {"id"=>"Criterion_15", "type"=>"number", "length"=>"2", "Value"=>["0"]},
      {"id"=>"DayTraining", "type"=>"string", "length"=>"1", "Value"=>["Y"]},
      {"id"=>"SaturdayTraining", "type"=>"string", "length"=>"1"},
      {"id"=>"CitizenStageID", "type"=>"string", "length"=>"12", "Value"=>["731192"]},
      {"id"=>"NoShow", "type"=>"string", "length"=>"1"},
      {"id"=>"NightTraining", "type"=>"string", "length"=>"1"},
      {"id"=>"Withdrew", "type"=>"string", "length"=>"1"},
      {"id"=>"JobStageID", "type"=>"string", "lth"=>"12", "Value"=>["2292"]},
      {"id"=>"DirectHire", "type"=>"string", "length"=>"1"}]
  }]
}

I am contemplating fixing the problem and providing Hash with a working implementation for from_xml and was hoping to find some feedback from others who reached the same conclusion. Surely we are not the only ones with these frustrations.

In the meantime we may find solace in knowing there is something lighter than Nokogiri and its full kitchen sink for this task.

nJoy!

Upvotes: 1

Mark Thomas
Mark Thomas

Reputation: 37517

Thanks for adding the additional information that this is a rating for an interviewee. Using this domain information in your code will likely improve it. You haven't posted any code, but generally using domain objects leads to more concise and more readable code. I recommend creating a simple class representing a Rating, rather than transforming data from XML to a data structure.

class Rating
  attr_accessor :image_filename, :criterion_1, :withdrew
end

Using the above class, here's one way to extract the fields from the XML using Nokogiri.

doc = Nokogiri::XML(xml)
ratings = []

doc.xpath('//Record').each do |record|
    rating = Rating.new
    rating.image_filename = record.at('Field[@id="ImageFilename"]/Value/text()').to_s
    rating.criterion_1 = record.at('Field[@id="Criterion_1"]/Value/text()').to_s
    rating.withdrew = record.at('Field[@id="Withdrew"]/Value/text()').to_s
    ratings << rating
end

Now, ratings is a list of Rating objects, each with methods to retrieve the data. This is a lot cleaner than delving into a deep data structure. You could even improve on the Rating class further, for example creating a withdrew? method that returns a true or false.

Upvotes: 0

Related Questions