magneto
magneto

Reputation: 113

dealing with multiple charset in python 3

I'm using python 3.3.0 in Windows 8.

requrl = urllib.request.Request(url) 

response = urllib.request.urlopen(requrl)

source = response.read()

source = source.decode('utf-8')

It will work fine if the websites have utf-8 charset but what if it has iso-8859-1 or any other charset. Means I may have different website url with different charset. So, how to deal with multiple charset?

Now let me tell you my efforts when I tried to resolve this issue like:

    b1 = b'charset=iso-8859-1'
    b1 = b1.decode('iso-8859-1')

    if b1 in source:
            source = source.decode('iso-8859-1')

It gave me an error like TypeError: Type str doesn't support the buffer API So, I'm assuming that it's considering b1 as string! and this is not the correct way! :(

Please, don't say that manually change charset in the source code or have you read python docs! I have already tried to put my head into python 3 docs but still have no luck or I may not be picking up correct modules/contents to read!

Upvotes: 2

Views: 3629

Answers (4)

Francis Avila
Francis Avila

Reputation: 31651

In Python 3, a str is actually a sequence of unicode characters (equivalent to u'mystring' syntax in Python 2). What you get back from response.read() is a byte string (a sequence of bytes).

The reason your b1 in source fails is you are trying to find a unicode character sequence inside a byte string. This makes no sense, so it fails. If you take out the line b1.decode('iso-8859-1'), it should work because you are now comparing two byte sequences.

Now back to your real underlying issue. To support multiple charsets, you need to determine the character set so you cn decode it to a Unicode string. This is tricky to do. Normally you can examine the Content-Type header of the response. (See the rules below.) However, so many websites declare the wrong encoding in the header that we have had to develop other complicated encoding sniffing rules for html. Please read that link so you realize what a difficult problem this is!

I recommend you either:

  1. Use the requests library instead of urllib, because it automatically takes care of most unicode conversions properly. (It's also much easier to use.) If conversion to unicode at this layer fails:
  2. Try to pass the bytes directly to an underlying library you are using (e.g. lxml or html5lib) and let them deal with determining the encoding. They often implement the right charset-sniffing algorithms for the document type.

If neither of these work, you can get more aggressive and use libraries like chardet to detect the encoding, but in my experience people who serve their web pages this incorrectly are so incompetent that they produce mixed-encoding documents, so you will end up with garbage characters no matter what you do!

Here are the rules for interpreting the charset declared in a content-type header.

  1. With no explicit charset declared:
    1. text/* (e.g., text/html) is in ASCII.
    2. application/* (e.g. application/json, application/xhtml+xml) is utf-8.
  2. With an explicit charset declared:
    1. if type is text/html and charset is iso-8859-1, it's actually win-1252 (==CP1252)
    2. otherwise use the charset declared.

(Note that the html5 spec willfully violates the w3c specs by looking for UTF8 and UTF16 byte markers in preference to the Content-Type header. Please read that encoding detection algorithm link and see why we can't have nice things...)

Upvotes: 5

Jonas Schäfer
Jonas Schäfer

Reputation: 20738

Have a look at the HTML standard, Parsing HTML documents, Determine character set (HTML5 is sufficient for our purposes).

There is an algorithm to take. For your purpose boils down to the following:

  1. Check for identifying sequences for UTF-16 or UTF-8 (see provided link)
  2. Use the character set supplied by HTTP (via the Content-Type header)
  3. Apply the algorithm described a little later in Prescan a byte-stream to determine its encoding. This is basically searching for "charset=" in the document and extracting the value.

Upvotes: 1

mata
mata

Reputation: 69092

The big problem here is that in many cases you can't be sure about the encoding of a webpage, even if it defines a charset. I've seen enough pages declaring one charset but acutally being in another, or having a different charsets in their Content-Type header then in their meta-tag or xml declaration.

In such cases chardet can be helpful.

Upvotes: 2

SilentGhost
SilentGhost

Reputation: 320019

You're checking whether str bytes contained within bytes object:

>>> 'df' in b'df'
Traceback (most recent call last):
  File "<pyshell#107>", line 1, in <module>
    'df' in b'df'
TypeError: Type str doesn't support the buffer API

So, yes, it considers b1 a str, because you've decoded bytes object into a str object with the certain encoding. Instead, you should check against original value of b1. It's not clear why you do .decode on it.

Upvotes: 1

Related Questions