Reputation: 113
I'm using python 3.3.0 in Windows 8.
requrl = urllib.request.Request(url)
response = urllib.request.urlopen(requrl)
source = response.read()
source = source.decode('utf-8')
It will work fine if the websites have utf-8
charset but what if it has iso-8859-1
or any other charset
. Means I may have different website url with different charset.
So, how to deal with multiple charset?
Now let me tell you my efforts when I tried to resolve this issue like:
b1 = b'charset=iso-8859-1'
b1 = b1.decode('iso-8859-1')
if b1 in source:
source = source.decode('iso-8859-1')
It gave me an error like TypeError: Type str doesn't support the buffer API
So, I'm assuming that it's considering b1 as string! and this is not the correct way! :(
Please, don't say that manually change charset in the source code or have you read python docs! I have already tried to put my head into python 3 docs but still have no luck or I may not be picking up correct modules/contents to read!
Upvotes: 2
Views: 3629
Reputation: 31651
In Python 3, a str
is actually a sequence of unicode characters (equivalent to u'mystring'
syntax in Python 2). What you get back from response.read()
is a byte string (a sequence of bytes).
The reason your b1 in source
fails is you are trying to find a unicode character sequence inside a byte string. This makes no sense, so it fails. If you take out the line b1.decode('iso-8859-1')
, it should work because you are now comparing two byte sequences.
Now back to your real underlying issue. To support multiple charsets, you need to determine the character set so you cn decode it to a Unicode string. This is tricky to do. Normally you can examine the Content-Type
header of the response. (See the rules below.) However, so many websites declare the wrong encoding in the header that we have had to develop other complicated encoding sniffing rules for html. Please read that link so you realize what a difficult problem this is!
I recommend you either:
lxml
or html5lib
) and let them deal with determining the encoding. They often implement the right charset-sniffing algorithms for the document type.If neither of these work, you can get more aggressive and use libraries like chardet to detect the encoding, but in my experience people who serve their web pages this incorrectly are so incompetent that they produce mixed-encoding documents, so you will end up with garbage characters no matter what you do!
Here are the rules for interpreting the charset declared in a content-type
header.
(Note that the html5 spec willfully violates the w3c specs by looking for UTF8 and UTF16 byte markers in preference to the Content-Type header. Please read that encoding detection algorithm link and see why we can't have nice things...)
Upvotes: 5
Reputation: 20738
Have a look at the HTML standard, Parsing HTML documents, Determine character set (HTML5 is sufficient for our purposes).
There is an algorithm to take. For your purpose boils down to the following:
Upvotes: 1
Reputation: 69092
The big problem here is that in many cases you can't be sure about the encoding of a webpage, even if it defines a charset. I've seen enough pages declaring one charset but acutally being in another, or having a different charsets in their Content-Type header then in their meta-tag or xml declaration.
In such cases chardet
can be helpful.
Upvotes: 2
Reputation: 320019
You're checking whether str
bytes contained within bytes
object:
>>> 'df' in b'df'
Traceback (most recent call last):
File "<pyshell#107>", line 1, in <module>
'df' in b'df'
TypeError: Type str doesn't support the buffer API
So, yes, it considers b1
a str
, because you've decoded bytes
object into a str
object with the certain encoding. Instead, you should check against original value of b1
. It's not clear why you do .decode
on it.
Upvotes: 1