Dominic
Dominic

Reputation: 317

Python - Advanced string escaping

I have a string in python. I used escape() to get rid of the newlines, now my string looks like this:

<p>Wie hoch ist der Anteil «oraler MS-Medikamente»
bei Neuverschreibungen in Ihrer Sprechstunde? </p>

But its supposed to look like this

Wie hoch ist der Anteil oraler MS-Medikamente bei Neuverschreibungen in Ihrer Sprechstunde?

What can I do?

Upvotes: 1

Views: 126

Answers (3)

Alderven
Alderven

Reputation: 8270

List all unnecessary symbols in the characters list and then replace them:

string = '<p>Wie hoch ist der Anteil «oraler MS-Medikamente»bei Neuverschreibungen in Ihrer Sprechstunde? </p>'

def unescape(s):
    characters = ["<p>", "<", ">", "&", "laquo;", "raquo;", "nbsp;", "/p"]
    for character in characters:
        s = s.replace(character, "")
    return s

print(unescape(string))

Here is the result:

Wie hoch ist der Anteil oraler MS-Medikamentebei Neuverschreibungen in Ihrer Sprechstunde?

Upvotes: 0

user1630938
user1630938

Reputation:

  1. Try to decode (reverse escape).
    HTML Encoder / Decoder - Converts characters to their corresponding HTML Entities - Web 2.0 Generators http://goo.gl/2tcml1

  2. You could use also this hint

import BeautifulSoup

soup= BeautifulSoup(raw_html)
cleantext = soup.text

Upvotes: 1

Maroun
Maroun

Reputation: 95958

You can unescape the string in order to get HTML tags back:

import HTMLParser
parser = HTMLParser.HTMLParser()
str = parser.unescape(str)

and then use some regex to remove HTML tags:

p = re.compile(r'<.*?>')
return p.sub('', str)

I don't really recommend using regexes for parsing HTML, you can use BeautifulSoup instead.

Upvotes: 0

Related Questions