Reputation: 4767
I have HTML with nested repeated tags:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div>
<div>
<div>
<p>Some text</p>
</div>
</div>
</div>
</body>
</html>
I want to remove nested repeated div
s which don't have any attributes. The resulting HTML should look like:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div>
<p>Some text</p>
</div>
</body>
</html>
How can that be done using Nokogiri or pure Ruby?
Upvotes: 3
Views: 386
Reputation: 3789
Normally I'm not a huge fan of mutable structures like Nokogiri uses, but in this case I think it works in your advantage. Something like this might work:
def recurse node
# depth first so we don't accidentally modify a collection while
# we're iterating through it.
node.elements.each do |child|
recurse(child)
end
# replace this element's children with it's grandchildren
# assuming it meets all the criteria
if merge_candidate?(node)
node.children = node.elements.first.children
end
end
def merge_candidate? node, name: 'div'
return false unless node.element?
return false unless node.attributes.empty?
return false unless node.name == name
return false unless node.elements.length == 1
return false unless node.elements.first.name == name
return false unless node.elements.first.attributes.empty?
true
end
[18] pry(main)> file = File.read('test.html')
[19] pry(main)> doc = Nokogiri.parse(file)
[20] pry(main)> puts doc
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div>
<div>
<div>
<p>Some text</p>
</div>
</div>
</div>
</body>
</html>
[21] pry(main)> recurse(doc)
[22] pry(main)> puts doc
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div>
<p>Some text</p>
</div>
</body>
</html>
=> nil
[23] pry(main)>
Upvotes: 2
Reputation: 160551
Based on how your HTML is structured this should get you going:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div>
<div>
<div>
<p>Some text</p>
</div>
</div>
</div>
</body>
</html>
EOT
dd = doc.at('div div').parent
dp = dd.at('div p')
dd.children.unlink
dp.parent = dd
Which results in:
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <head>
# >> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
# >> </head>
# >> <body>
# >> <div><p>Some text</p></div>
# >> </body>
# >> </html>
dd
is the parent
for two successive div
tags, in other words it's the first div
in the chain.
dp
is the p
node at the end of that chain.
dd.children
is a NodeSet containing the children
of dd
, all the way down to, and including, dp
.
The idea is to graft dp
, (the desired <p>
node), to dd
, (the topmost <div>
node), after removing all the other intervening <div>
tags. A NodeSet makes it easy to unlink
large numbers of tags at once.
Read about at
to understand why it's significant for this sort of problem.
Upvotes: 1