Cao Vison
Cao Vison

Reputation: 57

Why encoding in utf-8 still results in ascii?

As per this code:

# coding=utf-8
import sys
import chardet

print(sys.getdefaultencoding())

a = 'abc'

print(type(a))
print(chardet.detect(a))

b = a.decode('ascii')

print(type(b))


c = '中文'

print(type(c))
print(chardet.detect(c))


m = b.encode('utf-8')
print(type(m))
print(chardet.detect(m))

n = u'abc'

print(type(n))

x = n.encode(encoding='utf-8')

print(type(x))
print(chardet.detect(x))

I use utf-8 to encode n but the result still show the result is ascii.

So I want to know, what is relation between utf-8, ascii and unicode.

i run with python2.

===================result================================= enter image description here

=======================end result =============================

Upvotes: 0

Views: 2769

Answers (3)

Mark Ransom
Mark Ransom

Reputation: 308530

It's because the designers of both Unicode and UTF-8 were brilliant and managed to achieve an impressive feat of backwards compatibility.

It started with the Latin-1 character set, which defined 256 characters the first 128 of which were taken directly from ASCII. Each of these characters fit into a single byte.

Unicode built an expanded character set, and it started by stating that the first 256 codepoints would be the characters from Latin-1. This meant that the first 128 codepoints retained the same numeric value they had in ASCII.

Then came UTF-8, which utilized a variable bit length encoding. Characters which took more than a single byte were signified by having the upper bit of each byte set. This meant that the bytes with their upper bit clear would all be single byte characters. Since ASCII also has the upper bit clear, it means that the encoding for those characters are identical between ASCII and UTF-8!

Upvotes: 0

Charles Langlois
Charles Langlois

Reputation: 4298

UTF-8 encoding is such that characters 0-127(unicode codepoints U+0000 to U+007F) are the corresponding ascii characters and are encoded the same way. charset.detect thus naturally counfounds a string containing only those characters as ascii encoded, since in effect it is...

The u'...' notation in python 3 is there only for retrocompatibility, and is the same as normal string notation. So u'abc' is the same as 'abc'.

Upvotes: 0

Michael Guffre
Michael Guffre

Reputation: 379

UTF-8 is actually a variable-width encoding, and it just so happens that ASCII characters will map directly in UTF-8.

Since your UTF-8 string contains only ASCII characters, the string is, well honestly both an ASCII and UTF-8 string.

This visual might help:

>>> c = '中文abc中文'
>>>
>>>
>>> c
'中文abc中文'
>>> c.encode(encoding="UTF-8")
b'\xe4\xb8\xad\xe6\x96\x87abc\xe4\xb8\xad\xe6\x96\x87'

Notice how the "abc" in the UTF-8 string are only single-byte? They are still the same bytes as their ascii counterparts!

Upvotes: 2

Related Questions