Removing html tags and entities from string in python

I am getting xml data from api.careerbuilder.com Particularly, the string contains some html entities I am willing to remove, to no effect!

I have tried doing this:

import re
re.sub('\<.*?\>', '', job_title_text)

and this

from html.parser import HTMLParser
class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

strip_tags(job_title_text)

and finally this

import lxml.html
(lxml.html.fromstring(job_title_text)).text_content()

But all of these were failures. The second approach deleted html entities like "&amp" but the text inside the tags was left, that is "pbrspan", for example. Third one completely ruined everything, no data was shown at all, instead

< bound method HtmlElement.text_content of < Element html at 0x33717d8> >

Finally, I suspect, that the regex I have written is entirely wrong. Any ideas, how this can be handled?

Upvotes: 1

Views: 3432

Answers (2)

Ali SAID OMAR
Ali SAID OMAR

Reputation: 6792

Consider to use BeautifulSoup to remove tags, pretty well documented, http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Removing%20elements

Upvotes: 0

arm.localhost
arm.localhost

Reputation: 479

Try this regular expression

(\&lt\;).*?(\&gt\;)

Upvotes: 1

Related Questions