Reputation: 1235
I am reading a webpage content and checking for a word with umlauts. The word is present in the page content. But the python find('ü')
function is not finding the word.
import urllib2
opener = urllib2.build_opener()
page_content = opener.open(url).read()
page_content.find('ü')
I have tried to convert the search string with u'ü'. Then the error is
'SyntaxError: (unicode error) 'utf8' codec can't decode byte 0xfc in position 0'
I have used # -- coding: utf-8 -- in my .py file.
I have print the page_content. There the umlaut ü is converting to 'ü'. If I try with page_content.find('ü'), it is working fine. Please let me know if there is any better solution for this.
I would greatly appreciate any suggestions.
Upvotes: 2
Views: 1279
Reputation: 363507
Your Python tries to parse the source file (or console input) as UTF-8, but it's actually encoded in Latin-1. You could try to put a
# coding: iso-8859-1
comment at the top of the source file, or better, use an editor/terminal emulator that supports UTF-8 and save your scripts in that encoding.
Upvotes: 2
Reputation: 11163
If you define UTF-8 encoding at the top of the file as follows things should work. Please note that the coding
line must be either first line, or second line after the hashbang.
#!/usr/bin/python
# coding: utf-8
import urllib2
url = 'http://en.wikipedia.org/wiki/Germanic_umlaut'
opener = urllib2.build_opener()
page_content = opener.open(url).read()
page_content.find(u'ü')
Upvotes: 0