Louis Thibault
Louis Thibault

Reputation: 21420

Removing html image tags and everything in between from a string

I've seen a number of questions about removing HTML tags from strings, but I'm still a bit unclear on how my specific case should be handled.

I've seen that many posts advise against using regular expressions to handle HTML, but I suspect my case may warrant judicious circumvention of this rule.

I'm trying to parse PDF files and I've successfully managed to convert each page from my sample PDF file into a string of UTF-32 text. When images appear, an HTML-style tag is inserted which contains the name and location of the image (which is saved elsewhere).

In a separate portion of my app, I need to get rid of these image tags. Because we're only dealing with image tags, I suspect the use of a regex may be warranted.

My question is twofold:

  1. Should I use a regex to remove these tags, or should I still use an HTML parsing module such as BeautifulSoup?
  2. Which regex or BeautifulSoup construct should I use? In other words, how should I code this?

For clarity, the tags are structured as <img src="/path/to/file"/>

Thanks!

Upvotes: 10

Views: 10050

Answers (3)

Cubiczx
Cubiczx

Reputation: 1135

My solution is:

def remove_HTML_tag(tag, string):
    string = re.sub(r"<\b(" + tag + r")\b[^>]*>", r"", string)
    return re.sub(r"<\/\b(" + tag + r")\b[^>]*>", r"", string)

Upvotes: 0

senderle
senderle

Reputation: 151047

Since this text contains only image tags, it's probably OK to use a regex. But for anything else you're probably better off using a bonafide HTML parser. Fortunately Python provides one! This is pretty bare-bones -- to be fully functional, this would have to handle a lot more corner cases. (Most notably, XHTML-style empty tags (ending with a slash <... />) aren't handled correctly here.)

>>> from HTMLParser import HTMLParser
>>> 
>>> class TagDropper(HTMLParser):
...     def __init__(self, tags_to_drop, *args, **kwargs):
...         HTMLParser.__init__(self, *args, **kwargs)
...     self._text = []
...         self._tags_to_drop = set(tags_to_drop)
...     def clear_text(self):
...         self._text = []
...     def get_text(self):
...         return ''.join(self._text)
...     def handle_starttag(self, tag, attrs):
...         if tag not in self._tags_to_drop:
...             self._text.append(self.get_starttag_text())
...     def handle_endtag(self, tag):
...         self._text.append('</{0}>'.format(tag))
...     def handle_data(self, data):
...         self._text.append(data)
... 
>>> td = TagDropper([])
>>> td.feed('A line of text\nA line of text with an <img url="foo"> tag\nAnother line of text with a <br> tag\n')
>>> print td.get_text()
A line of text
A line of text with an <img url="foo"> tag
Another line of text with a <br> tag

And to drop img tags...

>>> td = TagDropper(['img'])
>>> td.feed('A line of text\nA line of text with an <img url="foo"> tag\nAnother line of text with a <br> tag\n')
>>> print td.get_text()
A line of text
A line of text with an  tag
Another line of text with a <br> tag

Upvotes: 3

joshcartme
joshcartme

Reputation: 2747

I would vote that in your case it is acceptable to use a regular expression. Something like this should work:

def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

I found that snippet here (http://love-python.blogspot.com/2008/07/strip-html-tags-using-python.html)

edit: version which will only remove things of the form <img .... />:

def remove_img_tags(data):
    p = re.compile(r'<img.*?/>')
    return p.sub('', data)

Upvotes: 15

Related Questions