Reputation: 107
I am getting xml data from api.careerbuilder.com Particularly, the string contains some html entities I am willing to remove, to no effect!
I have tried doing this:
import re
re.sub('\<.*?\>', '', job_title_text)
and this
from html.parser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
strip_tags(job_title_text)
and finally this
import lxml.html
(lxml.html.fromstring(job_title_text)).text_content()
But all of these were failures. The second approach deleted html entities like "&" but the text inside the tags was left, that is "pbrspan", for example. Third one completely ruined everything, no data was shown at all, instead
< bound method HtmlElement.text_content of < Element html at 0x33717d8> >
Finally, I suspect, that the regex I have written is entirely wrong. Any ideas, how this can be handled?
Upvotes: 1
Views: 3432
Reputation: 6792
Consider to use BeautifulSoup to remove tags, pretty well documented, http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Removing%20elements
Upvotes: 0