NinjaCoder
NinjaCoder

Reputation: 23

How to get first level children for XML using Nokogiri

I am trying to parse a POM file using Nokogiri, and want to get the first level child nodes.

My POM file looks something like this:

<project xmlns="some.maven.link">
   <parent>
     <groupId>parent.jar</groupId>
     <artifactId>parent-jar</artifactId>  
   </parent>         
   <groupId>child.jar</groupId>
   <artifactId>child-jar</artifactId>
 </project>

I am trying to fetch the artifactId "child-jar" but the XPath that I am using is possibly incorrect and it's fetching me "parent.jar" as the first occurence.

This is my Ruby code:

@pom = Nokogiri::XML(File.open(file_path))
p @pom.xpath("/project/artifactId", project"=>"http://maven.apache.org/POM/4.0.0")[0].text

I can access the second element but that just would be a hack.

Upvotes: 2

Views: 1819

Answers (1)

the Tin Man
the Tin Man

Reputation: 160551

Your XML sample does not appear to be correct. Simplifying it:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<project>
  <parent>
    <groupId>parent.jar</groupId>
    <artifactId>parent-jar</artifactId>  
  </parent>         
  <groupId>child.jar</groupId>
  <artifactId>child-jar</artifactId>
</project>
EOT

doc.at('project > artifactId').text # => "child-jar"

Using XPath I'd use:

doc.at('/project/artifactId').text # => "child-jar"

I'd suggest learning the difference between search, xpath, css and their at* cousins which are all documented in the "Searching a XML/HTML Document" and Node documentation.

In the above example I removed the XML namespace information to simplify things. XML namespaces are useful, but also are irritating and in your example XML you'd broken it by not supplying a valid URL. Fixing the example with:

<project xmlns="http://www.w3.org/1999/xhtml">

I can use:

namespaces = doc.collect_namespaces  # => {"xmlns"=>"http://www.w3.org/1999/xhtml"}
doc.at('project > artifactId', namespaces).text # => "child-jar"

or:

doc.at('xmlns|project > xmlns|artifactId').text # => "child-jar"

I prefer and recommend the first because it's more readable and less noisy. Nokogiri's implementation of CSS in selectors helps simplify most selectors. Passing in the collected namespaces in the document simplifies searches, whether you're using CSS or XPath.

These also work:

doc.at('/xmlns:project/xmlns:artifactId').text # => "child-jar"
doc.at('/foo:project/foo:artifactId', {'foo' => "http://www.w3.org/1999/xhtml"}).text # => "child-jar"

Note that the second uses a renamed namespace, which is useful if you're dealing with redundant xmlns declarations in the document and need to differentiate between them.

Nokogiri's "Namespaces" tutorial is helpful.

Upvotes: 3

Related Questions