How to remove repeated nested tags using Nokogiri

Question

I have HTML with nested repeated tags:

I want to remove nested repeated divs which don't have any attributes. The resulting HTML should look like:


  
    
  
  
    
      Some text

How can that be done using Nokogiri or pure Ruby?

GSP · Accepted Answer

Normally I'm not a huge fan of mutable structures like Nokogiri uses, but in this case I think it works in your advantage. Something like this might work:

def recurse node
  # depth first so we don't accidentally modify a collection while
  # we're iterating through it.
  node.elements.each do |child|
    recurse(child)
  end

  # replace this element's children with it's grandchildren
  # assuming it meets all the criteria
  if merge_candidate?(node)
    node.children = node.elements.first.children
  end
end

def merge_candidate? node, name: 'div'
  return false unless node.element?
  return false unless node.attributes.empty?
  return false unless node.name == name
  return false unless node.elements.length == 1
  return false unless node.elements.first.name == name
  return false unless node.elements.first.attributes.empty?

  true
end

[18] pry(main)> file = File.read('test.html')
[19] pry(main)> doc = Nokogiri.parse(file)
[20] pry(main)> puts doc


  
    
  
  
    
      
        
          Some text
          
      
    
  

[21] pry(main)> recurse(doc)
[22] pry(main)> puts doc


  
    
  
  
    
      Some text
    
  

=> nil
[23] pry(main)>

How to remove repeated nested tags using Nokogiri

Answers (2)

Related Questions