Reputation: 137
I would like to remove a tag from some HTML without stripping the remaining content of any markup. For example, I have a file, test.html:
<p class="P1"><span class="T2">Some text, goes to uppercase</span>
<p class="P4"><span class="T4"> </span><span class="T3">other text</span>
<span class="T5">italics</span><span class="T3">‘more text with UTF-8 ’</span>
</p></p>
I would like to get the following output:
SOME TEXT, GOES TO UPPERCASE
other text
<em>italics<em> ‘more text with UTF-8 ’
My code is:
f = File.open('raw/test.html',"r")
doc = Nokogiri::XML::DocumentFragment.parse(f.read.encode('UTF-8'))
f.close
doc.css("span.T2").each do |span|
span.replace span.content.upcase
end
doc.css("span.T5").each do |span|
span.replace "<em>"+span.content+"</em>"
end
doc.css("span").each do |span|
span.replace span.content
end
doc.css("p").each do |p|
p.replace Nokogiri::XML::Text.new(p.inner_html, p.document)
end
f = File.open('processed/test.html',"w")
f.write(doc)
f.close
And the output I get is:
SOME TEXT, GOES TO UPPERCASE
<p class="P4">
other text
<em>italics </em>&#x2018;more text with UTF-8 &#x2019;
&#x2018;our common mother&#x2019;
</p>
Many thanks in advance.
The solution was as follows:
coder = HTMLEntities.new
f = File.open('raw/test.html',"r")
doc = Nokogiri::XML::DocumentFragment.parse(f.read.encode('UTF-8'))
f.close
doc.css("p").each do |p|
p.replace p.inner_html
end
doc.css("span.T2").each do |span|
span.replace span.content.upcase
end
doc.css("span.T5").each do |span|
span.replace "<em>"+span.content+"</em>"
end
doc.css("span").each do |span|
span.replace span.inner_html
end
f = File.open('processed/test.html',"w")
f.write(coder.decode(doc))
f.close
Upvotes: 3
Views: 2810
Reputation: 160551
Using span.replace "<em>"+span.content+"</em>"
isn't correct. You need to tell Nokogiri to replace with HTML, not text. For instance:
span.inner_html = "<em>"+span.content+"</em>"
results in:
<span class="T5"><em>italics</em></span>
Upvotes: 1