lumos
lumos

Reputation: 223

Remove nokogiri attribute based on namespace prefix

I'm using nokogiri to parse an XML file. Some of the nodes in the file have attributes specific to namespaces:

<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
    <dc:identifier id="iden" opf:scheme="ISBN">xxxx</dc:identifier>
    <dc:creator opf:role="aut" opf:file-as="Name">xxxx</dc:creator>
    <dc:date opf:event="publication">xxxx</dc:date>
    <dc:publisher>xxxx</dc:publisher>
    <meta name="cover" content="x"/>
</metadata>

I'm trying to remove any attribute with the "opf" prefix. I've come across xpath solutions in finding an attribute value based on a partial match, but what about when it's a partial match of the attribute name itself? I tried a lot of things that haven't worked. I did a simple thing just to try to extract the attribute names at least, but if I do:

elements = @doc.at_xpath('//xmlns:metadata').children
elements.each { |el|
    el.attributes.each { |attribute|
        if attribute[1].namespace_scopes[1].prefix == "opf"
            puts attribute[0]
        end
    }   
}

I end up getting:

id
scheme
role
file-as
event
name
content

but I only want the ones with the "opf" prefix ("opf:scheme", "opf:role, "opf:file-as", "opf:event") so that they can be removed, without touching any of the other attributes. I even tried to force it by hard-coding the attributes I knew existed:

opf_attributes = ["opf:file-as","opf:scheme","opf:role","opf:event"]
elements.each  { |el|
    opf_attributes.each { |x|
        el.remove_attribute(x) if el[x] != nil
    }
} 

which is not the smartest way to go about this, but this still didn't work. Nothing happens to the nodes, and the attributes remain as they were. (I don't know if it's worth noting, but if I use the remove_attr(x) method instead, I get this error: undefined method 'remove_attr' for #<Nokogiri::XML::Element:0x...>

So, my question is:
Is there a clearer way to

  1. find attributes based on a partial match and/or the namespace prefix, then
  2. remove those attributes from the nodes that contain them?

Upvotes: 2

Views: 1345

Answers (2)

Amadan
Amadan

Reputation: 198436

I believe this is much simpler:

doc.xpath('//@opf:*', { opf: "http://www.idpf.org/2007/opf" }).each(&:remove)

// searches any descendant node, @ indicates it has to be an attribute node, opf: in conjunction with the namespace definition ({ opf: "http://www.idpf.org/2007/opf" }) says what namespace it has to belong to, and * matches any name.


Note that opf: by itself doesn't mean anything; "http://www.idpf.org/2007/opf" does, and opf is just a shorthand in its scope. .xpath('//@foobar:*', { foobar: "http://www.idpf.org/2007/opf" }) would work just as well for your case.

Since you have the namespace definition on the root, and it doesn't change within the document, you can simplify to

doc.xpath('//@opf:*', doc.namespaces).each(&:remove)

but note that this is not generally safe (e.g. the namespace could be defined on a subnode). doc.collect_namespaces is a bit safer instead, but even then you are not completely safe (e.g. if the same prefix is used for two different URIs in different parts of the document). I'd go with the first one (explicit URI) unless I actually saw the XML with my eyes and know where and how the prefix is defined and used.

tl;dr: Prefixes mean nothing, refer to the associated URI instead.

Upvotes: 1

Nick Veys
Nick Veys

Reputation: 23949

Node objects have a remove method that drops them from the tree, so you can write something like this:

require 'nokogiri'

doc  = Nokogiri::XML(DATA)
puts '--- Before'
puts doc.to_s

doc.traverse do |node|
  next unless node.respond_to? :attributes
  node.attributes.each do |key, val|
    val.remove if val&.namespace&.prefix == 'opf'
  end
end

puts
puts '--- After'
puts doc.to_s

__END__
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
    <dc:identifier id="iden" opf:scheme="ISBN">xxxx</dc:identifier>
    <dc:creator opf:role="aut" opf:file-as="Name">xxxx</dc:creator>
    <dc:date opf:event="publication">xxxx</dc:date>
    <dc:publisher>xxxx</dc:publisher>
    <meta name="cover" content="x"/>
</metadata>

And see the following output:

➜  ~ ruby test.rb
--- Before
<?xml version="1.0"?>
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
    <dc:identifier id="iden" opf:scheme="ISBN">xxxx</dc:identifier>
    <dc:creator opf:role="aut" opf:file-as="Name">xxxx</dc:creator>
    <dc:date opf:event="publication">xxxx</dc:date>
    <dc:publisher>xxxx</dc:publisher>
    <meta name="cover" content="x"/>
</metadata>

--- After
<?xml version="1.0"?>
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
    <dc:identifier id="iden">xxxx</dc:identifier>
    <dc:creator>xxxx</dc:creator>
    <dc:date>xxxx</dc:date>
    <dc:publisher>xxxx</dc:publisher>
    <meta name="cover" content="x"/>
</metadata>

Note If the Ruby version you are using doesn't support &. you'll need to handle the namespace being potentially nil.

Upvotes: 1

Related Questions