agnitio
agnitio

Reputation: 137

Nokogiri replace strips content of HTML

I would like to remove a tag from some HTML without stripping the remaining content of any markup. For example, I have a file, test.html:

<p class="P1"><span class="T2">Some text, goes to uppercase</span>
<p class="P4"><span class="T4"> </span><span class="T3">other text</span>
<span class="T5">italics</span><span class="T3">‘more text with UTF-8 ’</span>
</p></p>

I would like to get the following output:

SOME TEXT, GOES TO UPPERCASE
other text
<em>italics<em> ‘more text with UTF-8 ’

My code is:

f = File.open('raw/test.html',"r")
doc = Nokogiri::XML::DocumentFragment.parse(f.read.encode('UTF-8'))
f.close

doc.css("span.T2").each do |span|
  span.replace span.content.upcase
end
doc.css("span.T5").each do |span|
  span.replace "<em>"+span.content+"</em>"
end
doc.css("span").each do |span|
  span.replace span.content
end
doc.css("p").each do |p|
  p.replace Nokogiri::XML::Text.new(p.inner_html, p.document)
end

f = File.open('processed/test.html',"w")
f.write(doc)
f.close

And the output I get is:

SOME TEXT, GOES TO UPPERCASE
&lt;p class="P4"&gt;
 other text
&lt;em&gt;italics &lt;/em&gt;&amp;#x2018;more text with UTF-8 &amp;#x2019;
&amp;#x2018;our common mother&amp;#x2019;
&lt;/p&gt;

Many thanks in advance.

UPDATE

The solution was as follows:

coder = HTMLEntities.new 
f = File.open('raw/test.html',"r") 
doc = Nokogiri::XML::DocumentFragment.parse(f.read.encode('UTF-8')) 
f.close 
doc.css("p").each do |p| 
  p.replace p.inner_html 
end 

doc.css("span.T2").each do |span| 
  span.replace span.content.upcase 
end 

doc.css("span.T5").each do |span| 
  span.replace "<em>"+span.content+"</em>" 
end 

doc.css("span").each do |span| 
  span.replace span.inner_html 
end 

f = File.open('processed/test.html',"w") 
f.write(coder.decode(doc)) 
f.close

Upvotes: 3

Views: 2810

Answers (1)

the Tin Man
the Tin Man

Reputation: 160551

Using span.replace "<em>"+span.content+"</em>" isn't correct. You need to tell Nokogiri to replace with HTML, not text. For instance:

span.inner_html = "<em>"+span.content+"</em>"

results in:

<span class="T5"><em>italics</em></span>

Upvotes: 1

Related Questions