Reputation: 1054
I am pulling entries from an XML file using Xml::Parser on top of Nokogiri::XML::Reader. I would like to only grab the tags where the "Property/PropertyID/Identification['OrganizationName' == 'northsteppe']" but can not figure out the correct syntax to do so, here is the entire rake task I've been building below, followed by a sample node with all information and tags in it below that. Any guidance would be much appreciated.
================ UPDATE ===============
The file I am parsing is being pulled in using open-uri as it is from an external source, I am just working with a hard copy of an older version on my local machine to speed things up during development as the file is 300MB+ in size. I was trying to use a SAX parser, but that logic seemed a bit to complex for me to really grasp what was going on and I ran into the same issue, which was limiting the properties I was grabbing to only those that were 'northsteppe' as the OrganizationName in the Identification tag, that being said, I opted to try the same task with the current approach, I am able to grab almost all of the information I need, I just am missing the last piece, which I mentioned above.
=============== GETTING AS SPECIFIC AS POSSIBLE =============
So, I feel as though describing the exact task I am trying to preform will help clear up any missing gaps. The task is as follows.
Grab every property from the XML File that has the OraganizationName = 'northsteppe' in the <Identification>
tag and then grab all of the corresponding information in relation to each property individually and insert it into a hash. After all of the information for an individual property is gathered and placed in that hash, it needs to be uploaded as an individual entry to the database, which is already built out in the way it needs to be. Once that property is inserted to the DB, then the rake task moves on to the next entry of Property
that meets the specifications of having the OrganizationName = 'northsteppe' in the <Identification>
tag and repeats the process, until all properties that meet the above listed specifications have been inserted to the DB. The purpose of doing this is so that I can then run quick searches on the data of just the Northsteppe properties with out bogging down the system with every single property in the XML File.
Eventually, I will be using open-uri to pull the file from it's external source and run a cron job to execute this rake task once every 6 hours and replace the DB.
================= CODE =================
namespace :db do
# RAKE TASK DESCRIPTION
desc "Fetch property information and insert it into the database"
# RAKE TASK NAME
task :print_properties => :environment do
require 'rubygems'
require 'nokogiri'
module Xml
class Parser
def initialize(node, &block)
@node = node
@node.each do
self.instance_eval &block
end
end
def name
@node.name
end
def inner_xml
@node.inner_xml.strip
end
def is_start?
@node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
end
def is_end?
@node.node_type == Nokogiri::XML::Reader::TYPE_END_ELEMENT
end
def attribute(attribute)
@node.attribute(attribute)
end
def for_element(name, &block)
return unless self.name == name and is_start?
self.instance_eval &block
end
def inside_element(name=nil, &block)
return if @node.self_closing?
return unless name.nil? or (self.name == name and is_start?)
name = @node.name
depth = @node.depth
@node.each do
return if self.name == name and is_end? and @node.depth == depth
self.instance_eval &block
end
end
end
end
Xml::Parser.new(Nokogiri::XML::Reader(open("app/assets/xml/mits.xml"))) do
inside_element 'Property' do
# OPEN AND PARSE THE <PropertyID> TAG
inside_element 'PropertyID' do
inside_element 'Identification' do
puts attribute_nodes()
end
# OPEN AND PARSE THE <Address> TAG
inside_element 'Address' do
for_element 'AddressLine1' do puts "Street Address: #{inner_xml}" end
for_element 'City' do puts "City: #{inner_xml}" end
for_element 'PostalCode' do puts "Zipcode: #{inner_xml}" end
end
for_element 'MarketingName' do puts "Short Description: #{inner_xml}" end
end
# OPEN AND PARSE THE <Information> TAG
inside_element 'Information' do
for_element 'LongDescription' do puts "Long Description: #{inner_xml}" end
inside_element 'Rents' do
for_element 'StandardRent' do puts "Rent: #{inner_xml}" end
end
end
inside_element 'Fee' do
for_element 'ApplicationFee' do puts "Application Fee: #{inner_xml}" end
end
inside_element 'ILS_Identification' do
for_element 'Latitude' do puts "Latitude: #{inner_xml}" end
for_element 'Longitude' do puts "Longitude: #{inner_xml}" end
end
end
end
end #END INSERT_PROPERTIES TASK
end #END NAMESPACE
and a sample of the XML --
<Property IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
<PropertyID>
<Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="northsteppe" IDType="property"/>
<Identification IDValue="6e1e61523972d5f0e260e3d38eb488337424f21e" OrganizationName="northsteppe" IDType="Company"/>
<MarketingName>Spacious House Central Campus OSU, available fall</MarketingName>
<WebSite>http://northsteppe.appfolio.com/listings/listings/642da00e-9be3-4a7c-bd50-66a4f0d70af8</WebSite>
<Address AddressType="property">
<Description>Address of Available Listing</Description>
<AddressLine1>1689 N 4th St </AddressLine1>
<City>Columbus</City>
<State>OH</State>
<PostalCode>43201</PostalCode>
<Country>US</Country>
</Address>
<Phone PhoneType="office">
<PhoneNumber>(614) 299-4110</PhoneNumber>
</Phone>
<Email>[email protected]</Email>
</PropertyID>
<ILS_Identification ILS_IdentificationType="Apartment" RentalType="Market Rate">
<Latitude>39.997694</Latitude>
<Longitude>-82.99903</Longitude>
<LastUpdate Month="11" Day="11" Year="2013"/>
</ILS_Identification>
<Information>
<StructureType>Standard</StructureType>
<UnitCount>1</UnitCount>
<ShortDescription>Spacious House Central Campus OSU, available fall</ShortDescription>
<LongDescription>One of our favorites! This great house is perfect for students or a single family. With huge living and sleeping rooms, there is plenty of space. The kitchen is totally modernized with new appliances, and the bathroom has been updated. Natural woodwork and brick accents are seen within the house, and the decorative mantles. Ceiling fans and mini-blinds are included, as well as a FREE stack washer and dryer. The front and side deck. On site parking available.</LongDescription>
<Rents>
<StandardRent>2000.00</StandardRent>
</Rents>
<PropertyAvailabilityURL>http://northsteppe.appfolio.com/listings/listings/642da00e-9be3-4a7c-bd50-66a4f0d70af8</PropertyAvailabilityURL>
</Information>
<Fee>
<ProrateType>Standard</ProrateType>
<LateType>Standard</LateType>
<LatePercent>0</LatePercent>
<LateMinFee>0</LateMinFee>
<LateFeePerDay>0</LateFeePerDay>
<NonRefundableHoldFee>0</NonRefundableHoldFee>
<AdminFee>0</AdminFee>
<ApplicationFee>30.00</ApplicationFee>
<BrokerFee>0</BrokerFee>
</Fee>
<Deposit DepositType="Security Deposit">
<Amount AmountType="Actual">
<ValueRange Exact="2000.00" Currency="USD"/>
</Amount>
</Deposit>
<Policy>
<Pet Allowed="false"/>
</Policy>
<Phase IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
<Name/>
<Description/>
<UnitCount>1</UnitCount>
<RentableUnits>1</RentableUnits>
<TotalSquareFeet>0</TotalSquareFeet>
<RentableSquareFeet>0</RentableSquareFeet>
</Phase>
<Building IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
<Name/>
<Description/>
<UnitCount>1</UnitCount>
<SquareFeet>0</SquareFeet>
</Building>
<Floorplan IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
<Name/>
<UnitCount>1</UnitCount>
<Room RoomType="Bedroom">
<Count>4</Count>
<Comment/>
</Room>
<Room RoomType="Bathroom">
<Count>1</Count>
<Comment/>
</Room>
<SquareFeet Min="0" Max="0"/>
<MarketRent Min="2000" Max="2000"/>
<EffectiveRent Min="2000" Max="2000"/>
</Floorplan>
<ILS_Unit IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
<Units>
<Unit>
<Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="UL Portfolio"/>
<MarketingName>Spacious House Central Campus OSU, available fall</MarketingName>
<UnitBedrooms>4</UnitBedrooms>
<UnitBathrooms>1.0</UnitBathrooms>
<MinSquareFeet>0</MinSquareFeet>
<MaxSquareFeet>0</MaxSquareFeet>
<SquareFootType>internal</SquareFootType>
<UnitRent>2000.00</UnitRent>
<MarketRent>2000.00</MarketRent>
<Address AddressType="property">
<AddressLine1>1689 N 4th St </AddressLine1>
<City>Columbus</City>
<PostalCode>43201</PostalCode>
<Country>US</Country>
</Address>
</Unit>
</Units>
<Availability>
<VacateDate Month="7" Day="23" Year="2014"/>
<VacancyClass>Unoccupied</VacancyClass>
<MadeReadyDate Month="7" Day="23" Year="2014"/>
</Availability>
<Amenity AmenityType="Other">
<Description>All new stainless steel appliances! Refinished hardwood floors</Description>
</Amenity>
<Amenity AmenityType="Other">
<Description>Ceramic tile</Description>
</Amenity>
<Amenity AmenityType="Other">
<Description>Ceiling fans</Description>
</Amenity>
<Amenity AmenityType="Other">
<Description>Wrap-around porch</Description>
</Amenity>
<Amenity AmenityType="Dryer">
<Description>Free Washer and Dryer</Description>
</Amenity>
<Amenity AmenityType="Washer">
<Description>Free Washer and Dryer</Description>
</Amenity>
<Amenity AmenityType="Other">
<Description>off-street parking available</Description>
</Amenity>
</ILS_Unit>
<File Active="true" FileID="820982141">
<FileType>Photo</FileType>
<Description>Unit Photo</Description>
<Name/>
<Caption/>
<Format>image/jpeg</Format>
<Src>http://pa.cdn.appfolio.com/northsteppe/images/31077069-6e81-4373-8a89-508c57585543/medium.jpg</Src>
<Width>360</Width>
<Height>300</Height>
<Rank>1</Rank>
</File>
<File Active="true" FileID="820982145">
<FileType>Photo</FileType>
<Description>Unit Photo</Description>
<Name/>
<Caption/>
<Format>image/jpeg</Format>
<Src>http://pa.cdn.appfolio.com/northsteppe/images/84e1be40-96fd-4717-b75d-09b39231a762/medium.jpg</Src>
<Width>350</Width>
<Height>265</Height>
<Rank>2</Rank>
</File>
<File Active="true" FileID="820982149">
<FileType>Photo</FileType>
<Description>Unit Photo</Description>
<Name/>
<Caption/>
<Format>image/jpeg</Format>
<Src>http://pa.cdn.appfolio.com/northsteppe/images/cd419635-c37f-4676-a43e-c72671a2a748/medium.jpg</Src>
<Width>350</Width>
<Height>265</Height>
<Rank>3</Rank>
</File>
<File Active="true" FileID="820982152">
<FileType>Photo</FileType>
<Description>Unit Photo</Description>
<Name/>
<Caption/>
<Format>image/jpeg</Format>
<Src>http://pa.cdn.appfolio.com/northsteppe/images/6b68dbd5-2cde-477c-99d7-3ca33f03cce8/medium.jpg</Src>
<Width>350</Width>
<Height>265</Height>
<Rank>4</Rank>
</File>
<File Active="true" FileID="820982155">
<FileType>Photo</FileType>
<Description>Unit Photo</Description>
<Name/>
<Caption/>
<Format>image/jpeg</Format>
<Src>http://pa.cdn.appfolio.com/northsteppe/images/17b6c7c0-686c-4e46-865b-11d80744354a/medium.jpg</Src>
<Width>350</Width>
<Height>265</Height>
<Rank>5</Rank>
</File>
<File Active="true" FileID="820982157">
<FileType>Photo</FileType>
<Description>Unit Photo</Description>
<Name/>
<Caption/>
<Format>image/jpeg</Format>
<Src>http://pa.cdn.appfolio.com/northsteppe/images/3545ac8b-471f-404a-94b2-fcd00dd16e25/medium.jpg</Src>
<Width>350</Width>
<Height>265</Height>
<Rank>6</Rank>
</File>
<File Active="true" FileID="820982160">
<FileType>Photo</FileType>
<Description>Unit Photo</Description>
<Name/>
<Caption/>
<Format>image/jpeg</Format>
<Src>http://pa.cdn.appfolio.com/northsteppe/images/02471172-2183-4bf1-a3d7-33415f902c1c/medium.jpg</Src>
<Width>350</Width>
<Height>265</Height>
<Rank>7</Rank>
</File>
</Property>
Upvotes: 0
Views: 699
Reputation: 1054
So the solution I discovered was in a little gem called Saxerator (https://github.com/soulcutter/saxerator). It does SAX Parsing, without Nokogiri (thank you), has excellent documentation and runs super fast. I would encourage anyone who needs to use a SAX Parser in the future to investigate this little gem (pun intended) and alleviate the burden of having to process all of that horribly written Nokogiri documentation. The solution to my problem is below and located in my seeds.rb file.
require 'saxerator'
parser = Saxerator.parser(File.new("app/assets/xml/mits_snip.xml")) do |config|
config.put_attributes_in_hash!
config.symbolize_keys!
end
parser.for_tag(:Property).each do |property|
if property[:PropertyID][:Identification][1][:OrganizationName] == 'northsteppe'
property_attributes = {
street_address: property[:PropertyID][:Address][:AddressLine1],
city: property[:PropertyID][:Address][:City],
zipcode: property[:PropertyID][:Address][:PostalCode],
short_description: property[:PropertyID][:MarkertName],
long_description: property[:Information][:LongDescription],
rent: property[:Information][:Rents][:StandardRent],
application_fee: property[:Fee][:ApplicationFee],
vacancy_status: property[:ILS_Unit][:Availability][:VacancyClass],
month_available: property[:ILS_Unit][:Availability][:MadeReadyDate][:Month],
latitude: property[:ILS_Identification][:Latitude],
longitude: property[:ILS_Identification][:Longitude]
}
if Property.create! property_attributes
puts "wahoo"
else
puts "nope"
end
end
end
============== UPDATE =================
So I actually rewrote this task do work a lot better and just wanted to share it down here incase anyone ever stumbles across this question -- this is in my seeds.rb file
require 'saxerator'
require 'open-uri'
@company_name = 'northsteppe'
parser = Saxerator.parser(File.new("../../shared/assets/xml/mits.xml")) do |config|
config.put_attributes_in_hash!
config.symbolize_keys!
end
puts "DELETED ALL EXISITNG PROPERTIES" if Property.delete_all
puts "PULLING RELEVENT XML ENTERIES"
@@count = 0
file = File.new("../../shared/assets/xml/nsr_properties.xml",'w')
properties = []
parser.for_tag(:Property).each do |property|
print '*'
if property[:PropertyID][:Identification][1][:OrganizationName] == @company_name
properties << property
@@count = @@count +1
end
# break if @@count == 417
end
file.write(properties.to_xml)
file.close
puts "ADDING PROPERTIES TO THE DATABASE"
nsr_properties = File.open("../../shared/assets/xml/nsr_properties.xml")
doc = Nokogiri::XML(nsr_properties)
doc.xpath("//saxerator-builder-hash-elements/saxerator-builder-hash-element").each do |property|
print '.'
@images =[]
property.xpath("File/File").each do |image|
@images << image.at_xpath("Src/text()").to_s
end
@amenities = []
property.xpath("ILS-Unit/Amenity/Amenity").each do |amenity|
@amenities << amenity.at_xpath("Description/text()").to_s
end
information = {
"street_address" => property.at_xpath("PropertyID/Address/AddressLine1/text()").to_s,
"city" => property.at_xpath("PropertyID/Address/City/text()").to_s,
"zipcode" => property.at_xpath("PropertyID/Address/PostalCode/text()").to_s,
"short_description" => property.at_xpath("PropertyID/MarketingName/text()").to_s,
"long_description" => property.at_xpath("Information/LongDescription/text()").to_s,
"rent" => property.at_xpath("Information/Rents/StandardRent/text()").to_s,
"application_fee" => property.at_xpath("Fee/ApplicationFee/text()").to_s,
"bedrooms" => property.at_xpath("ILS-Unit/Units/Unit/UnitBedrooms/text()").to_s,
"bathrooms" => property.at_xpath("ILS-Unit/Units/Unit/UnitBathrooms/text()").to_s,
"vacancy_status" => property.at_xpath("ILS-Unit/Availability/VacancyClass/text()").to_s,
"month_available" => property.at_xpath("ILS-Unit/Availability/MadeReadyDate/@Month").to_s,
"latitude" => property.at_xpath("ILS-Identification/Latitude/text()").to_s,
"longitude" => property.at_xpath("ILS-Identification/Longitude/text()").to_s,
"images" => @images,
"amenities" => @amenities
}
Property.create!(information)
end
puts "DONE, WAHOO"
Upvotes: 1
Reputation: 160631
Try this for a start:
require 'nokogiri'
doc = Nokogiri::XML(File.read('test.xml'))
doc.search('*[OrganizationName="northsteppe"]')
# => [#<Nokogiri::XML::Element:0x3fd82499131c name="Identification" attributes=[#<Nokogiri::XML::Attr:0x3fd8249912b8 name="IDValue" value="642da00e-9be3-4a7c-bd50-66a4f0d70af8">, #<Nokogiri::XML::Attr:0x3fd8249912a4 name="OrganizationName" value="northsteppe">, #<Nokogiri::XML::Attr:0x3fd824991290 name="IDType" value="property">]>, #<Nokogiri::XML::Element:0x3fd824990a70 name="Identification" attributes=[#<Nokogiri::XML::Attr:0x3fd824990a0c name="IDValue" value="6e1e61523972d5f0e260e3d38eb488337424f21e">, #<Nokogiri::XML::Attr:0x3fd8249909f8 name="OrganizationName" value="northsteppe">, #<Nokogiri::XML::Attr:0x3fd8249909e4 name="IDType" value="Company">]>]
To make what Nokogiri found a bit more readable:
puts doc.search('*[OrganizationName="northsteppe"]').map{ |n| n.to_xml }
# >> <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="northsteppe" IDType="property"/>
# >> <Identification IDValue="6e1e61523972d5f0e260e3d38eb488337424f21e" OrganizationName="northsteppe" IDType="Company"/>
I find that using CSS is often a lot more readable than XPath. In this case it's a toss-up.
...the actual file is 300MB and loading in the DOM crashes the server.
If your server can't handle the file size, then your best bet is a SAX parser, which is as memory efficient as you can get. Here's a simple example using your sample XML:
require 'nokogiri'
class MyDocument < Nokogiri::XML::SAX::Document
@@tags = []
def start_element name, attributes = []
attribute_hash = Hash[attributes]
if (name == 'Identification') && (attribute_hash['OrganizationName'] == 'northsteppe')
@@tags << {
name: name,
attributes: attribute_hash
}
end
end
def tags
@@tags
end
end
doc = MyDocument.new
# Create a new parser
parser = Nokogiri::XML::SAX::Parser.new(doc)
# Feed the parser some XML
parser.parse(File.open('test.xml'))
doc.tags
# => [{:name=>"Identification",
# :attributes=>
# {"IDValue"=>"642da00e-9be3-4a7c-bd50-66a4f0d70af8",
# "OrganizationName"=>"northsteppe",
# "IDType"=>"property"}},
# {:name=>"Identification",
# :attributes=>
# {"IDValue"=>"6e1e61523972d5f0e260e3d38eb488337424f21e",
# "OrganizationName"=>"northsteppe",
# "IDType"=>"Company"}}]
Upvotes: 1