frenchloaf
frenchloaf

Reputation: 1054

Grabbing Only the Entries With Specific Attribute Value On A Given Node

I am pulling entries from an XML file using Xml::Parser on top of Nokogiri::XML::Reader. I would like to only grab the tags where the "Property/PropertyID/Identification['OrganizationName' == 'northsteppe']" but can not figure out the correct syntax to do so, here is the entire rake task I've been building below, followed by a sample node with all information and tags in it below that. Any guidance would be much appreciated.

================ UPDATE ===============

The file I am parsing is being pulled in using open-uri as it is from an external source, I am just working with a hard copy of an older version on my local machine to speed things up during development as the file is 300MB+ in size. I was trying to use a SAX parser, but that logic seemed a bit to complex for me to really grasp what was going on and I ran into the same issue, which was limiting the properties I was grabbing to only those that were 'northsteppe' as the OrganizationName in the Identification tag, that being said, I opted to try the same task with the current approach, I am able to grab almost all of the information I need, I just am missing the last piece, which I mentioned above.

=============== GETTING AS SPECIFIC AS POSSIBLE =============

So, I feel as though describing the exact task I am trying to preform will help clear up any missing gaps. The task is as follows.

Grab every property from the XML File that has the OraganizationName = 'northsteppe' in the <Identification> tag and then grab all of the corresponding information in relation to each property individually and insert it into a hash. After all of the information for an individual property is gathered and placed in that hash, it needs to be uploaded as an individual entry to the database, which is already built out in the way it needs to be. Once that property is inserted to the DB, then the rake task moves on to the next entry of Property that meets the specifications of having the OrganizationName = 'northsteppe' in the <Identification> tag and repeats the process, until all properties that meet the above listed specifications have been inserted to the DB. The purpose of doing this is so that I can then run quick searches on the data of just the Northsteppe properties with out bogging down the system with every single property in the XML File.

Eventually, I will be using open-uri to pull the file from it's external source and run a cron job to execute this rake task once every 6 hours and replace the DB.

================= CODE =================

namespace :db do

# RAKE TASK DESCRIPTION
desc "Fetch property information and insert it into the database"

# RAKE TASK NAME    
task :print_properties => :environment do

    require 'rubygems'
    require 'nokogiri'

    module Xml
      class Parser
        def initialize(node, &block)
          @node = node
          @node.each do
            self.instance_eval &block
          end
        end

        def name
          @node.name
        end

        def inner_xml
          @node.inner_xml.strip
        end

        def is_start?
          @node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
        end

        def is_end?
          @node.node_type == Nokogiri::XML::Reader::TYPE_END_ELEMENT
        end

        def attribute(attribute)
          @node.attribute(attribute)
        end

        def for_element(name, &block)
          return unless self.name == name and is_start?
          self.instance_eval &block
        end

        def inside_element(name=nil, &block)
          return if @node.self_closing?
          return unless name.nil? or (self.name == name and is_start?)

          name = @node.name
          depth = @node.depth

          @node.each do
            return if self.name == name and is_end? and @node.depth == depth
            self.instance_eval &block
          end
        end
      end
    end


    Xml::Parser.new(Nokogiri::XML::Reader(open("app/assets/xml/mits.xml"))) do
        inside_element 'Property' do

            # OPEN AND PARSE THE <PropertyID> TAG
            inside_element 'PropertyID' do

                inside_element 'Identification' do
                    puts attribute_nodes()
                end

                # OPEN AND PARSE THE <Address> TAG
                inside_element 'Address' do
                    for_element 'AddressLine1' do puts "Street Address: #{inner_xml}" end
                    for_element 'City' do puts "City: #{inner_xml}" end
                    for_element 'PostalCode' do puts "Zipcode: #{inner_xml}" end
                end

            for_element 'MarketingName' do puts "Short Description: #{inner_xml}" end
            end

            # OPEN AND PARSE THE <Information> TAG
            inside_element 'Information' do
                for_element 'LongDescription' do puts "Long Description: #{inner_xml}" end
                inside_element 'Rents' do
                    for_element 'StandardRent' do puts "Rent: #{inner_xml}" end
                end
            end

            inside_element 'Fee' do
                for_element 'ApplicationFee' do puts "Application Fee: #{inner_xml}" end
            end

            inside_element 'ILS_Identification' do
                for_element 'Latitude' do puts "Latitude: #{inner_xml}" end
                for_element 'Longitude' do puts "Longitude: #{inner_xml}" end
            end

        end
    end

end #END INSERT_PROPERTIES TASK

end #END NAMESPACE

and a sample of the XML --

<Property IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
<PropertyID>
  <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="northsteppe" IDType="property"/>
  <Identification IDValue="6e1e61523972d5f0e260e3d38eb488337424f21e" OrganizationName="northsteppe" IDType="Company"/>
  <MarketingName>Spacious House Central Campus OSU, available fall</MarketingName>
  <WebSite>http://northsteppe.appfolio.com/listings/listings/642da00e-9be3-4a7c-bd50-66a4f0d70af8</WebSite>
  <Address AddressType="property">
    <Description>Address of Available Listing</Description>
    <AddressLine1>1689 N 4th St </AddressLine1>
    <City>Columbus</City>
    <State>OH</State>
    <PostalCode>43201</PostalCode>
    <Country>US</Country>
  </Address>
  <Phone PhoneType="office">
    <PhoneNumber>(614) 299-4110</PhoneNumber>
  </Phone>
  <Email>[email protected]</Email>
</PropertyID>
<ILS_Identification ILS_IdentificationType="Apartment" RentalType="Market Rate">
  <Latitude>39.997694</Latitude>
  <Longitude>-82.99903</Longitude>
  <LastUpdate Month="11" Day="11" Year="2013"/>
</ILS_Identification>
<Information>
  <StructureType>Standard</StructureType>
  <UnitCount>1</UnitCount>
  <ShortDescription>Spacious House Central Campus OSU, available fall</ShortDescription>
  <LongDescription>One of our favorites! This great house is perfect for students or a single family. With huge living and sleeping rooms, there is plenty of space. The kitchen is totally modernized with new appliances, and the bathroom has been updated. Natural woodwork and brick accents are seen within the house, and the decorative mantles. Ceiling fans and mini-blinds are included, as well as a FREE stack washer and dryer. The front and side deck. On site parking available.</LongDescription>
  <Rents>
    <StandardRent>2000.00</StandardRent>
  </Rents>
  <PropertyAvailabilityURL>http://northsteppe.appfolio.com/listings/listings/642da00e-9be3-4a7c-bd50-66a4f0d70af8</PropertyAvailabilityURL>
</Information>
<Fee>
  <ProrateType>Standard</ProrateType>
  <LateType>Standard</LateType>
  <LatePercent>0</LatePercent>
  <LateMinFee>0</LateMinFee>
  <LateFeePerDay>0</LateFeePerDay>
  <NonRefundableHoldFee>0</NonRefundableHoldFee>
  <AdminFee>0</AdminFee>
  <ApplicationFee>30.00</ApplicationFee>
  <BrokerFee>0</BrokerFee>
</Fee>
<Deposit DepositType="Security Deposit">
  <Amount AmountType="Actual">
    <ValueRange Exact="2000.00" Currency="USD"/>
  </Amount>
</Deposit>
<Policy>
  <Pet Allowed="false"/>
</Policy>
<Phase IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
  <Name/>
  <Description/>
  <UnitCount>1</UnitCount>
  <RentableUnits>1</RentableUnits>
  <TotalSquareFeet>0</TotalSquareFeet>
  <RentableSquareFeet>0</RentableSquareFeet>
</Phase>
<Building IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
  <Name/>
  <Description/>
  <UnitCount>1</UnitCount>
  <SquareFeet>0</SquareFeet>
</Building>
<Floorplan IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
  <Name/>
  <UnitCount>1</UnitCount>
  <Room RoomType="Bedroom">
    <Count>4</Count>
    <Comment/>
  </Room>
  <Room RoomType="Bathroom">
    <Count>1</Count>
    <Comment/>
  </Room>
  <SquareFeet Min="0" Max="0"/>
  <MarketRent Min="2000" Max="2000"/>
  <EffectiveRent Min="2000" Max="2000"/>
</Floorplan>
<ILS_Unit IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
  <Units>
    <Unit>
      <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="UL Portfolio"/>
      <MarketingName>Spacious House Central Campus OSU, available fall</MarketingName>
      <UnitBedrooms>4</UnitBedrooms>
      <UnitBathrooms>1.0</UnitBathrooms>
      <MinSquareFeet>0</MinSquareFeet>
      <MaxSquareFeet>0</MaxSquareFeet>
      <SquareFootType>internal</SquareFootType>
      <UnitRent>2000.00</UnitRent>
      <MarketRent>2000.00</MarketRent>
      <Address AddressType="property">
        <AddressLine1>1689 N 4th St </AddressLine1>
        <City>Columbus</City>
        <PostalCode>43201</PostalCode>
        <Country>US</Country>
      </Address>
    </Unit>
  </Units>
  <Availability>
    <VacateDate Month="7" Day="23" Year="2014"/>
    <VacancyClass>Unoccupied</VacancyClass>
    <MadeReadyDate Month="7" Day="23" Year="2014"/>
  </Availability>
  <Amenity AmenityType="Other">
    <Description>All new stainless steel appliances!  Refinished hardwood floors</Description>
  </Amenity>
  <Amenity AmenityType="Other">
    <Description>Ceramic tile</Description>
  </Amenity>
  <Amenity AmenityType="Other">
    <Description>Ceiling fans</Description>
  </Amenity>
  <Amenity AmenityType="Other">
    <Description>Wrap-around porch</Description>
  </Amenity>
  <Amenity AmenityType="Dryer">
    <Description>Free Washer and Dryer</Description>
  </Amenity>
  <Amenity AmenityType="Washer">
    <Description>Free Washer and Dryer</Description>
  </Amenity>
  <Amenity AmenityType="Other">
    <Description>off-street parking available</Description>
  </Amenity>
</ILS_Unit>
<File Active="true" FileID="820982141">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/31077069-6e81-4373-8a89-508c57585543/medium.jpg</Src>
  <Width>360</Width>
  <Height>300</Height>
  <Rank>1</Rank>
</File>
<File Active="true" FileID="820982145">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/84e1be40-96fd-4717-b75d-09b39231a762/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>2</Rank>
</File>
<File Active="true" FileID="820982149">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/cd419635-c37f-4676-a43e-c72671a2a748/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>3</Rank>
</File>
<File Active="true" FileID="820982152">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/6b68dbd5-2cde-477c-99d7-3ca33f03cce8/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>4</Rank>
</File>
<File Active="true" FileID="820982155">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/17b6c7c0-686c-4e46-865b-11d80744354a/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>5</Rank>
</File>
<File Active="true" FileID="820982157">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/3545ac8b-471f-404a-94b2-fcd00dd16e25/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>6</Rank>
</File>
<File Active="true" FileID="820982160">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/02471172-2183-4bf1-a3d7-33415f902c1c/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>7</Rank>
</File>
  </Property>

Upvotes: 0

Views: 699

Answers (2)

frenchloaf
frenchloaf

Reputation: 1054

So the solution I discovered was in a little gem called Saxerator (https://github.com/soulcutter/saxerator). It does SAX Parsing, without Nokogiri (thank you), has excellent documentation and runs super fast. I would encourage anyone who needs to use a SAX Parser in the future to investigate this little gem (pun intended) and alleviate the burden of having to process all of that horribly written Nokogiri documentation. The solution to my problem is below and located in my seeds.rb file.

    require 'saxerator'

parser = Saxerator.parser(File.new("app/assets/xml/mits_snip.xml")) do |config|
  config.put_attributes_in_hash!
  config.symbolize_keys!
end


parser.for_tag(:Property).each do |property|
    if property[:PropertyID][:Identification][1][:OrganizationName] == 'northsteppe'
        property_attributes = {
            street_address:     property[:PropertyID][:Address][:AddressLine1],
            city:               property[:PropertyID][:Address][:City],
            zipcode:            property[:PropertyID][:Address][:PostalCode],
            short_description:  property[:PropertyID][:MarkertName],
            long_description:   property[:Information][:LongDescription],
            rent:               property[:Information][:Rents][:StandardRent],
            application_fee:    property[:Fee][:ApplicationFee],
            vacancy_status:     property[:ILS_Unit][:Availability][:VacancyClass],
            month_available:    property[:ILS_Unit][:Availability][:MadeReadyDate][:Month],
            latitude:           property[:ILS_Identification][:Latitude],
            longitude:          property[:ILS_Identification][:Longitude]

        }

        if Property.create! property_attributes
            puts "wahoo"
        else
            puts "nope"
        end
    end
end

============== UPDATE =================

So I actually rewrote this task do work a lot better and just wanted to share it down here incase anyone ever stumbles across this question -- this is in my seeds.rb file

require 'saxerator'
require 'open-uri'
@company_name = 'northsteppe'
parser = Saxerator.parser(File.new("../../shared/assets/xml/mits.xml")) do |config|
  config.put_attributes_in_hash!
  config.symbolize_keys!
end
puts "DELETED ALL EXISITNG PROPERTIES" if Property.delete_all
puts "PULLING RELEVENT XML ENTERIES"
@@count = 0
file = File.new("../../shared/assets/xml/nsr_properties.xml",'w')
properties = []
parser.for_tag(:Property).each do |property|
    print '*'
    if property[:PropertyID][:Identification][1][:OrganizationName] == @company_name
        properties << property
        @@count = @@count +1
    end
    # break if @@count == 417 
end
file.write(properties.to_xml)
file.close
puts "ADDING PROPERTIES TO THE DATABASE"
nsr_properties = File.open("../../shared/assets/xml/nsr_properties.xml")
doc = Nokogiri::XML(nsr_properties)
doc.xpath("//saxerator-builder-hash-elements/saxerator-builder-hash-element").each do |property|
    print '.'
    @images =[]
    property.xpath("File/File").each do |image|
        @images << image.at_xpath("Src/text()").to_s
    end
    @amenities = []
    property.xpath("ILS-Unit/Amenity/Amenity").each do |amenity|
        @amenities << amenity.at_xpath("Description/text()").to_s
    end
    information = {
        "street_address" => property.at_xpath("PropertyID/Address/AddressLine1/text()").to_s,
        "city" => property.at_xpath("PropertyID/Address/City/text()").to_s,
        "zipcode" => property.at_xpath("PropertyID/Address/PostalCode/text()").to_s,
        "short_description" => property.at_xpath("PropertyID/MarketingName/text()").to_s,
        "long_description" => property.at_xpath("Information/LongDescription/text()").to_s,
        "rent" => property.at_xpath("Information/Rents/StandardRent/text()").to_s,
        "application_fee" => property.at_xpath("Fee/ApplicationFee/text()").to_s,
        "bedrooms" => property.at_xpath("ILS-Unit/Units/Unit/UnitBedrooms/text()").to_s,
        "bathrooms" => property.at_xpath("ILS-Unit/Units/Unit/UnitBathrooms/text()").to_s,
        "vacancy_status" => property.at_xpath("ILS-Unit/Availability/VacancyClass/text()").to_s,
        "month_available" => property.at_xpath("ILS-Unit/Availability/MadeReadyDate/@Month").to_s,
        "latitude" => property.at_xpath("ILS-Identification/Latitude/text()").to_s,
        "longitude" => property.at_xpath("ILS-Identification/Longitude/text()").to_s,
        "images" => @images,
        "amenities" => @amenities
    }
    Property.create!(information)
end
puts "DONE, WAHOO"

Upvotes: 1

the Tin Man
the Tin Man

Reputation: 160631

Try this for a start:

require 'nokogiri'

doc = Nokogiri::XML(File.read('test.xml'))
doc.search('*[OrganizationName="northsteppe"]') 
# => [#<Nokogiri::XML::Element:0x3fd82499131c name="Identification" attributes=[#<Nokogiri::XML::Attr:0x3fd8249912b8 name="IDValue" value="642da00e-9be3-4a7c-bd50-66a4f0d70af8">, #<Nokogiri::XML::Attr:0x3fd8249912a4 name="OrganizationName" value="northsteppe">, #<Nokogiri::XML::Attr:0x3fd824991290 name="IDType" value="property">]>, #<Nokogiri::XML::Element:0x3fd824990a70 name="Identification" attributes=[#<Nokogiri::XML::Attr:0x3fd824990a0c name="IDValue" value="6e1e61523972d5f0e260e3d38eb488337424f21e">, #<Nokogiri::XML::Attr:0x3fd8249909f8 name="OrganizationName" value="northsteppe">, #<Nokogiri::XML::Attr:0x3fd8249909e4 name="IDType" value="Company">]>]

To make what Nokogiri found a bit more readable:

puts doc.search('*[OrganizationName="northsteppe"]').map{ |n| n.to_xml }
# >> <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="northsteppe" IDType="property"/>
# >> <Identification IDValue="6e1e61523972d5f0e260e3d38eb488337424f21e" OrganizationName="northsteppe" IDType="Company"/>

I find that using CSS is often a lot more readable than XPath. In this case it's a toss-up.


...the actual file is 300MB and loading in the DOM crashes the server.

If your server can't handle the file size, then your best bet is a SAX parser, which is as memory efficient as you can get. Here's a simple example using your sample XML:

require 'nokogiri'

class MyDocument < Nokogiri::XML::SAX::Document
  @@tags = []

  def start_element name, attributes = []

    attribute_hash = Hash[attributes]
    if (name == 'Identification') && (attribute_hash['OrganizationName'] == 'northsteppe')
      @@tags << {
        name: name,
        attributes: attribute_hash
      }
    end
  end

  def tags
    @@tags
  end
end

doc = MyDocument.new

# Create a new parser
parser = Nokogiri::XML::SAX::Parser.new(doc)

# Feed the parser some XML
parser.parse(File.open('test.xml'))

doc.tags 
# => [{:name=>"Identification",
#      :attributes=>
#       {"IDValue"=>"642da00e-9be3-4a7c-bd50-66a4f0d70af8",
#        "OrganizationName"=>"northsteppe",
#        "IDType"=>"property"}},
#     {:name=>"Identification",
#      :attributes=>
#       {"IDValue"=>"6e1e61523972d5f0e260e3d38eb488337424f21e",
#        "OrganizationName"=>"northsteppe",
#        "IDType"=>"Company"}}]

Upvotes: 1

Related Questions