Reputation: 317
I have a string in python. I used escape() to get rid of the newlines, now my string looks like this:
<p>Wie hoch ist der Anteil &laquo;oraler MS-Medikamente&raquo;
bei Neuverschreibungen in Ihrer Sprechstunde?&nbsp;</p>
But its supposed to look like this
Wie hoch ist der Anteil oraler MS-Medikamente bei Neuverschreibungen in Ihrer Sprechstunde?
What can I do?
Upvotes: 1
Views: 126
Reputation: 8270
List all unnecessary symbols in the characters list and then replace them:
string = '<p>Wie hoch ist der Anteil &laquo;oraler MS-Medikamente&raquo;bei Neuverschreibungen in Ihrer Sprechstunde?&nbsp;</p>'
def unescape(s):
characters = ["<p>", "<", ">", "&", "laquo;", "raquo;", "nbsp;", "/p"]
for character in characters:
s = s.replace(character, "")
return s
print(unescape(string))
Here is the result:
Wie hoch ist der Anteil oraler MS-Medikamentebei Neuverschreibungen in Ihrer Sprechstunde?
Upvotes: 0
Reputation:
Try to decode (reverse escape).
HTML Encoder / Decoder - Converts characters to their corresponding HTML Entities - Web 2.0 Generators http://goo.gl/2tcml1
You could use also this hint
import BeautifulSoup
soup= BeautifulSoup(raw_html)
cleantext = soup.text
Upvotes: 1
Reputation: 95958
You can unescape the string in order to get HTML tags back:
import HTMLParser
parser = HTMLParser.HTMLParser()
str = parser.unescape(str)
and then use some regex to remove HTML tags:
p = re.compile(r'<.*?>')
return p.sub('', str)
I don't really recommend using regexes for parsing HTML, you can use BeautifulSoup
instead.
Upvotes: 0