Pradeeshnarayan
Pradeeshnarayan

Reputation: 1235

Python URL encoding with umlauts error

I am reading a webpage content and checking for a word with umlauts. The word is present in the page content. But the python find('ü') function is not finding the word.

import urllib2
opener = urllib2.build_opener()
page_content = opener.open(url).read() 
page_content.find('ü')

I have tried to convert the search string with u'ü'. Then the error is

'SyntaxError: (unicode error) 'utf8' codec can't decode byte 0xfc in position 0'

I have used # -- coding: utf-8 -- in my .py file.

I have print the page_content. There the umlaut ü is converting to 'ü'. If I try with page_content.find('ü'), it is working fine. Please let me know if there is any better solution for this.

I would greatly appreciate any suggestions.

Upvotes: 2

Views: 1279

Answers (2)

Fred Foo
Fred Foo

Reputation: 363507

Your Python tries to parse the source file (or console input) as UTF-8, but it's actually encoded in Latin-1. You could try to put a

# coding: iso-8859-1

comment at the top of the source file, or better, use an editor/terminal emulator that supports UTF-8 and save your scripts in that encoding.

Upvotes: 2

Maria Zverina
Maria Zverina

Reputation: 11163

If you define UTF-8 encoding at the top of the file as follows things should work. Please note that the coding line must be either first line, or second line after the hashbang.

#!/usr/bin/python
# coding: utf-8

import urllib2

url = 'http://en.wikipedia.org/wiki/Germanic_umlaut'
opener = urllib2.build_opener()
page_content = opener.open(url).read() 
page_content.find(u'ü')

Upvotes: 0

Related Questions