Reputation: 9702
I have this script, which reads the text from web page:
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page);
paragraphs = soup.findAll('p');
for p in paragraphs:
content = content+p.text+" ";
In the web page I have this string:
Möddinghofe
My script reads it as:
Möddinghofe
How can I read it as it is?
Upvotes: 1
Views: 986
Reputation: 11645
I suggest you take a look at the encoding section of the BeautifulSoup documentation.
Upvotes: 0
Reputation: 9622
Hope this would help you
from BeautifulSoup import BeautifulStoneSoup
import cgi
def HTMLEntitiesToUnicode(text):
"""Converts HTML entities to unicode. For example '&' becomes '&'."""
text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
return text
def unicodeToHTMLEntities(text):
"""Converts unicode to HTML entities. For example '&' becomes '&'."""
text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
return text
text = "&, ®, <, >, ¢, £, ¥, €, §, ©"
uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)
print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &, ®, <, >, ¢, £, ¥, €, §, ©
reference:Convert HTML entities to Unicode and vice versa
Upvotes: 1