Reputation: 57
As per this code:
# coding=utf-8
import sys
import chardet
print(sys.getdefaultencoding())
a = 'abc'
print(type(a))
print(chardet.detect(a))
b = a.decode('ascii')
print(type(b))
c = '中文'
print(type(c))
print(chardet.detect(c))
m = b.encode('utf-8')
print(type(m))
print(chardet.detect(m))
n = u'abc'
print(type(n))
x = n.encode(encoding='utf-8')
print(type(x))
print(chardet.detect(x))
I use utf-8
to encode n
but the result still show the result is ascii
.
So I want to know, what is relation between utf-8
, ascii
and unicode
.
i run with python2.
===================result=================================
=======================end result =============================
Upvotes: 0
Views: 2769
Reputation: 308530
It's because the designers of both Unicode and UTF-8 were brilliant and managed to achieve an impressive feat of backwards compatibility.
It started with the Latin-1 character set, which defined 256 characters the first 128 of which were taken directly from ASCII. Each of these characters fit into a single byte.
Unicode built an expanded character set, and it started by stating that the first 256 codepoints would be the characters from Latin-1. This meant that the first 128 codepoints retained the same numeric value they had in ASCII.
Then came UTF-8, which utilized a variable bit length encoding. Characters which took more than a single byte were signified by having the upper bit of each byte set. This meant that the bytes with their upper bit clear would all be single byte characters. Since ASCII also has the upper bit clear, it means that the encoding for those characters are identical between ASCII and UTF-8!
Upvotes: 0
Reputation: 4298
UTF-8 encoding is such that characters 0-127(unicode codepoints U+0000 to U+007F) are the corresponding ascii characters and are encoded the same way. charset.detect
thus naturally counfounds a string containing only those characters as ascii
encoded, since in effect it is...
The u'...'
notation in python 3 is there only for retrocompatibility, and is the same as normal string notation. So u'abc'
is the same as 'abc'
.
Upvotes: 0
Reputation: 379
UTF-8 is actually a variable-width encoding, and it just so happens that ASCII characters will map directly in UTF-8.
Since your UTF-8 string contains only ASCII characters, the string is, well honestly both an ASCII and UTF-8 string.
This visual might help:
>>> c = '中文abc中文'
>>>
>>>
>>> c
'中文abc中文'
>>> c.encode(encoding="UTF-8")
b'\xe4\xb8\xad\xe6\x96\x87abc\xe4\xb8\xad\xe6\x96\x87'
Notice how the "abc" in the UTF-8 string are only single-byte? They are still the same bytes as their ascii counterparts!
Upvotes: 2