Reputation: 317

Python - Advanced string escaping

I have a string in python. I used escape() to get rid of the newlines, now my string looks like this:

&lt;p&gt;Wie hoch ist der Anteil &amp;laquo;oraler MS-Medikamente&amp;raquo;
bei Neuverschreibungen in Ihrer Sprechstunde?&amp;nbsp;&lt;/p&gt;

But its supposed to look like this

Wie hoch ist der Anteil oraler MS-Medikamente bei Neuverschreibungen in Ihrer Sprechstunde?

What can I do?

Upvotes: 1

Answers (3)

Alderven

Reputation: 8280

List all unnecessary symbols in the characters list and then replace them:

string = '&lt;p&gt;Wie hoch ist der Anteil &amp;laquo;oraler MS-Medikamente&amp;raquo;bei Neuverschreibungen in Ihrer Sprechstunde?&amp;nbsp;&lt;/p&gt;'

def unescape(s):
    characters = ["&lt;p&gt;", "&lt;", "&gt;", "&amp;", "laquo;", "raquo;", "nbsp;", "/p"]
    for character in characters:
        s = s.replace(character, "")
    return s

print(unescape(string))

Here is the result:

Wie hoch ist der Anteil oraler MS-Medikamentebei Neuverschreibungen in Ihrer Sprechstunde?

Upvotes: 0

user1630938

Reputation:

Try to decode (reverse escape).
HTML Encoder / Decoder - Converts characters to their corresponding HTML Entities - Web 2.0 Generators http://goo.gl/2tcml1
You could use also this hint

import BeautifulSoup

soup= BeautifulSoup(raw_html)
cleantext = soup.text

Upvotes: 1

Maroun

Reputation: 96016

You can unescape the string in order to get HTML tags back:

import HTMLParser
parser = HTMLParser.HTMLParser()
str = parser.unescape(str)

and then use some regex to remove HTML tags:

p = re.compile(r'<.*?>')
return p.sub('', str)

I don't really recommend using regexes for parsing HTML, you can use BeautifulSoup instead.

Upvotes: 0

Python - Advanced string escaping

Answers (3)

Related Questions