Reputation: 125
I am trying to parse multiple XML files with Nokogiri. They are in the following format:
<?xml version="1.0" encoding="UTF-8"?>
<CRDoc>[Congressional Record Volume<volume>141</volume>, Number<number>213</number>(<weekday>Sunday</weekday>,<month>December</month>
<day>31</day>,<year>1995</year>)]
[<chamber>Senate</chamber>]
[Page<pages>S19323</pages>]<congress>104</congress>
<session>1</session>
<document_title>UNANIMOUS-CONSENT REQUEST--HOUSE MESSAGE ON S. 1508</document_title>
<speaker name="Mr. DASCHLE">Mr. DASCHLE</speaker>.<speaking name="Mr. DASCHLE">Mr. President, I said this on the floor yesterday
afternoon, and I will repeat it this afternoon. I know that the
distinguished majority leader wants an agreement as much as I do, and I
do not hold him personally responsible for the fact that we are not
able to overcome this impasse. I commend him for his efforts at trying
to do so again today.</speaking>
<speaking name="Mr. DASCHLE">Let me try one other option. We have already been unable to agree to
a continuing resolution that would have put all Federal employees back
to work with pay. We have been unable to agree to something that we
agreed to last Friday, the 22d of December, which would have at least
sent them back to their offices without pay. Perhaps we can try this.</speaking>
<speaking name="Mr. DASCHLE">I ask unanimous consent that the Senate proceed to the message from
the House on S. 1508, that the Senate concur in the House amendment
with a substitute amendment that includes the text of Senator Dole's
back-to-work bill, and the House-passed expedited procedures shall take
effect only if the budget agreement does not cut Medicare more than
necessary to ensure the solvency of the Medicare part A trust fund and,
second, does not raise taxes on working Americans, does not cut funding
for education or environmental enforcement, and maintains the
individual health guarantee under Medicaid and, third, provides that
any tax reductions in the budget agreement go only to Americans making
under $100,000; that the motion to concur be agreed to, and the motion
to reconsider be laid upon the table.</speaking>
<speaker name="The ACTING PRESIDENT pro tempore">The ACTING PRESIDENT pro tempore</speaker>.<speaking name="The ACTING PRESIDENT pro tempore">Is there objection?</speaking>
<speaker name="Mr. DOLE">Mr. DOLE</speaker>.<speaking name="Mr. DOLE">Mr. President, I want to say a few words. But I will
object.</speaking>
<speaking name="Mr. DOLE">We are working on a lot of these things in our meetings at the White
House, where we have both been for a number of hours. I think we have
made some progress. We are a long way from any solution yet.</speaking>
<speaking name="Mr. DOLE">I think all of the things listed by the Democratic leader are areas
of concern in the meetings we have had. And the meetings will start
again on Tuesday. But it seems to me that it would not be appropriate
to proceed under those terms, and therefore I object.</speaking>
<speaker name="The ACTING PRESIDENT pro tempore">The ACTING PRESIDENT pro tempore</speaker>.<speaking name="The ACTING PRESIDENT pro tempore">Objection is heard.</speaking>
</CRDoc>
The code I am using came from previous help and has worked a treat so far. However, the format of the XML files has changed and left the code unusable. The code I have is this:
doc.xpath("//speech/speaking/@name").map(&:text).uniq.each do |name|
speaker = Nokogiri::XML('<root/>')
doc.xpath('//speech').each do |speech|
speech_node = Nokogiri::XML('<speech/>')
speech.xpath("*[@name='#{name}']").each do |speaking|
speech_node.root.add_child(speaking)
end
speaker.root.add_child(speech_node.root) unless speech_node.root.children.empty?
end
File.open("test/" + name + "-" + year + ".xml", 'a+') do |f|
f.write speaker.root.children
end
end
I would like to create a new XML file for each speaker and in each new XML file have what they said. The code needs to be able to cycle through the various XML files in a directory and place each speech in the appropriate speaker file. I was thinking this could be accomplished with a find -exec
command.
Ultimately, the code should:
Mr. Boehner_2011.xml
CRDoc
root node.Upvotes: 0
Views: 753
Reputation: 37517
My suggestion is, rather than continuing to use code you don't understand, break it down into bits so that it is easier to understand, or at least it will be easier to isolate a problem.
Imagine being able to do this:
crdoc = CongressionalRecordDocument.new(filename)
crdoc.year
#=> 1995
crdoc.speakers
#=> ["Mr. DASCHLE", "The ACTING PRESIDENT pro tempore", "Mr. DOLE"]
crdoc.speakers.each do |speaker|
speech = crdoc.speaking_parts(speaker)
#save speech to file
end
This hides the details, making it much easier to read. Better yet, it would compartmentalize them so if the way you retrieve the speaker list changes, for example, you only have to change one small part, and that part will be easy to test. Let's implement it:
class CongressionalRecordDocument
def initialize(xml_file)
@doc = Nokogiri::XML(xml_file)
end
def year
@year ||= @doc.at('//year')
end
def speakers
@speakers ||= @doc.xpath('//speaker/@name').map(&:text).uniq
end
def speaking_parts(speaker)
@doc.xpath("//speaking[@name = '#{speaker}']").map(&:text)
end
end
Looks a lot less complex now, doesn't it? You may also want to create a class for your new document in a similar way so creating your output is as simple.
Also, instead of a find -exec
you may want to find your files in ruby:
Dir["/path/to/search/*.xml"].each do |file|
crdoc = CongressionalRecordDocument.new(file)
#etc
end
Upvotes: 4
Reputation: 37409
Since you don't have the <speech>
element anymore, you need to remove it from your code:
doc.xpath("//speaking/@name").map(&:text).uniq.each do |name|
speaker = Nokogiri::XML('<root/>')
doc.xpath('//CRDoc').each do |speech|
speech_node = Nokogiri::XML('<speech/>')
speech.xpath("*[@name='#{name}']").each do |speaking|
speech_node.root.add_child(speaking)
end
speaker.root.add_child(speech_node.root) unless speech_node.root.children.empty?
end
File.open("test/" + name + "-" + year + ".xml", 'a+') do |f|
f.write speaker.root.children
end
end
Upvotes: 1