Reputation: 45
I need to find the best method to gather writer and artist information from the following XML data. The comic
node appears multiple times and includes data for a single comic book.
I can't grab the appropriate person according to their job function, writer, artist, etc. There are sometimes multiple writers and artists of each comic book. My plan is to add/append each to a List.
So, for this single comic book, I need to get all the writers' and artists' display name, but the job function (e.g. writer) is a sibling of the persons name.
Here is what I have, but doesn't work:
writer = []
penciler = []
doc.xpath('//comic').each do |main_element|
main_element.xpath("mainsection/credits/credit/role[@id='dfWriter']").each do |n|
writer << n.xpath('person/displayname').text
end
main_element.xpath("mainsection/credits/credit/role[@id='dfPenciler']").each do |n|
penciler << n.xpath('person/displayname').text
end
end
p "Writer(s): ",writer
p "Penciler(s): ",penciler
This is the XML file/data:
<comic>
<id>3398</id>
<index>195</index>
<mainsection>
<title>Mind Games</title>
<myrating>0</myrating>
<myrating>
<displayname>0</displayname>
<sortname>0</sortname>
</myrating>
<pagecount>32</pagecount>
<credits>
<credit>
<role id="dfWriter">Writer</role>
<roleid>dfWriter</roleid>
<person>
<displayname>Will Pfeifer</displayname>
<sortname>Pfeifer, Will</sortname>
<lastname>Pfeifer</lastname>
<firstname>Will</firstname>
</person>
</credit>
<credit>
<role id="dfWriter">Writer</role>
<roleid>dfWriter</roleid>
<person>
<displayname>John Byrne</displayname>
<sortname>Byrne, John</sortname>
<lastname>Byrne</lastname>
<firstname>John</firstname>
</person>
</credit>
<credit>
<role id="dfPenciler">Penciller</role>
<roleid>dfPenciler</roleid>
<person>
<displayname>John Byrne</displayname>
<sortname>Byrne, John</sortname>
<lastname>Byrne</lastname>
<firstname>John</firstname>
</person>
</credit>
</credits>
</mainsection>
</comic>
The code I have does not give me the desired results. I found "Getting the siblings of a node with Nokogiri" but I need to iterate and grab each sibling.
I can either search by <roleid>dfWriter</roleid>
or <role id="dfWriter">Writer</role>
as they are the same.
My expected output would be:
Writer(s): Will Pfeifer, John Byrne
Penciler(s): John Byrne
Upvotes: 2
Views: 445
Reputation: 160631
Here's how I'd go about doing this:
require 'nokogiri'
XML = <<EOT
<comic>
<mainsection>
<credits>
<credit>
<role id="dfWriter">Writer</role>
<person>
<displayname>Will Pfeifer</displayname>
</person>
</credit>
<credit>
<role id="dfWriter">Writer</role>
<person>
<displayname>John Byrne</displayname>
</person>
</credit>
<credit>
<role id="dfPenciler">Penciller</role>
<person>
<displayname>John Byrne</displayname>
</person>
</credit>
</credits>
</mainsection>
</comic>
EOT
doc = Nokogiri::XML(XML)
writers = doc.search("credits role[id='dfWriter']").map { |w| w.parent.at('displayname').text }
pencilers = doc.search("credits role[id='dfPenciler']").map { |n| n.parent.at('displayname').text }
puts "Writer(s): %s" % writers.join(', ')
puts "Penciler(s): %s" % pencilers.join(', ')
# >> Writer(s): Will Pfeifer, John Byrne
# >> Penciler(s): John Byrne
Which, when run, outputs:
# >> Writer(s): Will Pfeifer, John Byrne
# >> Penciler(s): John Byrne
This:
writers = doc.search("credits role[id='dfWriter']").map { |w| w.parent.at('displayname').text }
pencilers = doc.search("credits role[id='dfPenciler']").map { |n| n.parent.at('displayname').text }
could be DRY'd to:
writers, pencilers = %w(dfWriter dfPenciler).map { |s|
doc.search("credits role[id='#{s}']").map { |w| w.parent.at('displayname').text }
}
I used CSS for readability, and at
, which returns a Node, when I want the text instead of xpath
, which returns a NodeSet.
That distinction between using text
on a NodeSet versus a Node is extremely important. Consider this:
require 'nokogiri'
xml = <<EOT
<root>
<displayname>Will Pfeifer</displayname>
<displayname>John Byrne</displayname>
<displayname>John Byrne</displayname>
</root>
EOT
doc = Nokogiri::XML(xml)
doc.search('displayname').class # => Nokogiri::XML::NodeSet
doc.search('displayname').text # => "Will PfeiferJohn ByrneJohn Byrne"
doc.at('displayname').class # => Nokogiri::XML::Element
doc.at('displayname').text # => "Will Pfeifer"
If you want all the text for a NodeSet in a easily usable form then extract it from each node:
doc.search('displayname').map(&:text) # => ["Will Pfeifer", "John Byrne", "John Byrne"]
Upvotes: 0
Reputation: 89335
You can use XPath following-sibling
axis for this purpose assuming the target element always located after role
:
doc.xpath('//comic').each do |main_element|
main_element.xpath("mainsection/credits/credit/role[@id='dfWriter']").each do |n|
writer << n.xpath('following-sibling::person/displayname').text
end
main_element.xpath("mainsection/credits/credit/role[@id='dfPenciler']").each do |n|
penciler << n.xpath('following-sibling::person/displayname').text
end
end
Or you can just iterate through credit
instead of role
in the first place :
doc.xpath('//comic').each do |main_element|
main_element.xpath("mainsection/credits/credit[role/@id='dfWriter']").each do |n|
writer << n.xpath('person/displayname').text
end
main_element.xpath("mainsection/credits/credit[role/@id='dfPenciler']").each do |n|
penciler << n.xpath('person/displayname').text
end
end
Upvotes: 1