Reputation: 27812
I have some HTML tables inside of an HTML cell, like so:
miniTable='<table style="width: 100%%" bgcolor="%s">
<tr><td><font color="%s"><b>%s</b></td></tr>
</table>' % ( bgcolor, fontColor, floatNumber)
html += '<td>' + miniTable + '</td>'
Is there a way to remove the HTML tags that pertain to this minitable, and only these html tags?
I would like to somehow remove these tags:
<table style="width: 100%%" bgcolor="%s"><tr><td><font color="%s"><b>
and
</b></td></tr></table>
to get this:
floatNumber
where floatNumber
is the string representation of a floating point number. I don't want any of the other HTML tags to be modified in any way. I was thinking of using string.replace or regex, but I'm stumped.
Upvotes: 1
Views: 7540
Reputation: 2049
If you can't install and use Beautiful Soup (otherwise BS is preferred, as @otto-allmendinger proposed):
import re
s = '<table style="width: 100%%" bgcolor="%s"><tr><td><font color="%s"><b>1.23</b></td></tr></table>'
result = float(re.sub(r"<.?table[^>]*>|<.?t[rd]>|<font[^>]+>|<.?b>", "", s))
Upvotes: 2
Reputation: 28288
Do not use str.replace or regex.
Use a html parsing library like Beautiful Soup, get the element you want and the contained text.
The final code should look something like this
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
for t in soup.find_all("table"): # the actual selection depends on your specific code
content = t.get_text()
# content should be the float number
Upvotes: 2