Removing html tags and entities from string in python

Question

I am getting xml data from api.careerbuilder.com Particularly, the string contains some html entities I am willing to remove, to no effect!

I have tried doing this:

import re
re.sub('\&lt;.*?\&gt;', '', job_title_text)

and this

from html.parser import HTMLParser
class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

strip_tags(job_title_text)

and finally this

import lxml.html
(lxml.html.fromstring(job_title_text)).text_content()

But all of these were failures. The second approach deleted html entities like "&" but the text inside the tags was left, that is "pbrspan", for example. Third one completely ruined everything, no data was shown at all, instead

< bound method HtmlElement.text_content of < Element html at 0x33717d8> >

Finally, I suspect, that the regex I have written is entirely wrong. Any ideas, how this can be handled?

arm.localhost · Accepted Answer

Try this regular expression

(\<\;).*?(\>\;)

Removing html tags and entities from string in python

Answers (2)

Related Questions