jh314
jh314

Reputation: 27812

Removing specific html tags with python

I have some HTML tables inside of an HTML cell, like so:

miniTable='<table style="width: 100%%" bgcolor="%s">
               <tr><td><font color="%s"><b>%s</b></td></tr>
           </table>' % ( bgcolor, fontColor, floatNumber)

html += '<td>' + miniTable + '</td>'

Is there a way to remove the HTML tags that pertain to this minitable, and only these html tags?
I would like to somehow remove these tags:

<table style="width: 100%%" bgcolor="%s"><tr><td><font color="%s"><b>
and
</b></td></tr></table>

to get this:

floatNumber

where floatNumber is the string representation of a floating point number. I don't want any of the other HTML tags to be modified in any way. I was thinking of using string.replace or regex, but I'm stumped.

Upvotes: 1

Views: 7540

Answers (2)

fedosov
fedosov

Reputation: 2049

If you can't install and use Beautiful Soup (otherwise BS is preferred, as @otto-allmendinger proposed):

import re
s = '<table style="width: 100%%" bgcolor="%s"><tr><td><font color="%s"><b>1.23</b></td></tr></table>'
result = float(re.sub(r"<.?table[^>]*>|<.?t[rd]>|<font[^>]+>|<.?b>", "", s))

Upvotes: 2

Otto Allmendinger
Otto Allmendinger

Reputation: 28288

Do not use str.replace or regex.

Use a html parsing library like Beautiful Soup, get the element you want and the contained text.

The final code should look something like this

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)

for t in soup.find_all("table"): # the actual selection depends on your specific code
    content = t.get_text()
    # content should be the float number

Upvotes: 2

Related Questions