Reputation: 55
I have a problem, and I have to find the fast solution.
I want to remove br
and p
tags inside all "tables" but not outside.
For ex.
Initial html document:
...
<p>Hello</p>
<table>
<tr>
<td><p>Text example <br>continues...</p></td>
<td><p>Text example <br>continues...</p></td>
<td><p>Text example <br>continues...</p></td>
<td><p>Text example <br>continues...</p></td>
</tr>
</table>
<p>Bye<br></p>
<p>Bye<br></p>
...
My objective:
...
<p>Hello</p>
<table>
<tr>
<td>Text example continues...</td>
<td>Text example continues...</td>
<td>Text example continues...</td>
<td>Text example continues...</td>
</tr>
</table>
<p>Bye<br></p>
<p>Bye<br></p>
...
Now, thats is my method to clean:
loop do
if html.match(/<table>(.*?)(<\/?(p|br)*?>)(.*?)<\/table>/) != nil
html = html.gsub(/<table>(.*?)(<\/?(p|br)*?>)(.*?)<\/table>/,'<table>\1 \4</table>')
else
break
end
end
That works great, but the problem is, I have 1xxx documents and every one have about 1000 lines... and takes 1-3 hours every one. ((1-3 hours)*(thousands documents)) = ¡pain!
I'm looking to do it with Sanitize or other method, but for now... I don't find the way.
Can anybody help me?
Thank you in advance! Manu
Upvotes: 2
Views: 147
Reputation: 118261
Using Nokogiri:
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-_HTML_
<p>Hello</p>
<table>
<tr>
<td><p>Text example <br>continues...</p></td>
<td><p>Text example <br>continues...</p></td>
<td><p>Text example <br>continues...</p></td>
<td><p>Text example <br>continues...</p></td>
</tr>
</table>
<p>Bye<br></p>
<p>Bye<br></p>
_HTML_
doc.xpath("//table/tr/td/p").each do |el|
el.replace(el.text)
end
puts doc.to_html
Output:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>Hello</p>
<table><tr>
<td>Text example continues...</td>
<td>Text example continues...</td>
<td>Text example continues...</td>
<td>Text example continues...</td>
</tr></table>
<p>Bye<br></p>
<p>Bye<br></p>
</body>
</html>
Upvotes: 4