How to find out internal string encoding?

From PEP 393 I understand that Python can use multiple encodings internally when storing strings: latin1, UCS-2, UCS-4. Is it possible to find out what encoding is used to store a particular string, e.g. in the interactive interpreter?

Upvotes: 4

Answers (3)

MSeifert

Reputation: 152647

There is a CPython C API function for the kind of the unicode object: PyUnicode_KIND.

In case you have Cython and IPython¹ you can easily access that function:

In [1]: %load_ext cython
   ...:

In [2]: %%cython
   ...:
   ...: cdef extern from "Python.h":
   ...:     int PyUnicode_KIND(object o)
   ...:
   ...: cpdef unicode_kind(astring):
   ...:     if type(astring) is not str:
   ...:         raise TypeError('astring must be a string')
   ...:     return PyUnicode_KIND(astring)

In [3]: a = 'a'
   ...: b = 'Ǧ'
   ...: c = '😀'

In [4]: unicode_kind(a), unicode_kind(b), unicode_kind(c)
Out[4]: (1, 2, 4)

Where 1 represents latin-1 and 2 and 4 represent UCS-2 and UCS-4 respectively.

You could then use a dictionary to map these numbers into a string that represents the encoding.

¹ It's also possible without Cython and/or IPython, the combination is just very handy, otherwise it would be more code (without IPython) and/or require a manual installation (without Cython).

Upvotes: 2

ShadowRanger

Reputation: 155363

The only way you can test this from the Python layer (without resorting to manually mucking about with object internals via ctypes or Python extension modules) is by checking the ordinal value of the largest character in the string, which determines whether the string is stored as ASCII/latin-1, UCS-2 or UCS-4. A solution would be something like:

def get_bpc(s):
    maxordinal = ord(max(s, default='\0'))
    if maxordinal < 256:
        return 1
    elif maxordinal < 65536:
        return 2
    else:
        return 4

You can't actually rely on sys.getsizeof because, for non-ASCII strings (even one byte per character strings that fit in the latin-1 range), the string might or might not have populated the UTF-8 representation of the string, and tricks like adding an extra character to it and comparing sizes could actually show the size decrease, and it can actually happen "at a distance", so you're not directly responsible for the existence of the cached UTF-8 form on the string you're checking. For example:

>>> e = 'é'
>>> sys.getsizeof(e)
74
>>> sys.getsizeof(e + 'a')
75
>>> class é: pass  # One of several ways to trigger creation/caching of UTF-8 form
>>> sys.getsizeof(e)
77  # !!! Grew three bytes even though it's the same variable
>>> sys.getsizeof(e + 'a')
75  # !!! Adding a character shrunk the string!

Upvotes: 0

randomir

Reputation: 18687

One way of finding out which exact internal encoding CPython uses for a specific unicode string is to peek in the actual (CPython) object.

According to PEP 393 (Specification section), all unicode string objects start with PyASCIIObject:

typedef struct {
  PyObject_HEAD
  Py_ssize_t length;
  Py_hash_t hash;
  struct {
      unsigned int interned:2;
      unsigned int kind:2;
      unsigned int compact:1;
      unsigned int ascii:1;
      unsigned int ready:1;
  } state;
  wchar_t *wstr;
} PyASCIIObject;

Character size is stored in the kind bit-field, as described in the PEP, as well as in the code comments in unicodeobject:

00 => str is not initialized (data are in wstr)
01 => 1 byte (Latin-1)
10 => 2 byte (UCS-2)
11 => 4 byte (UCS-4);

After we get the address of the string with id(string), we can use the ctypes module to read the object's bytes (and the kind field):

import ctypes
mystr = "x"
first_byte = ctypes.c_uint8.from_address(id(mystr)).value

The offset from the object's start to kind is PyObject_HEAD + Py_ssize_t length + Py_hash_t hash, which in turn is Py_ssize_t ob_refcnt + pointer to ob_type + Py_ssize_t length + size of another pointer for the hash type:

offset = 2 * ctypes.sizeof(ctypes.c_ssize_t) + 2 * ctypes.sizeof(ctypes.c_void_p)

(which is 32 on x64)

All put together:

import ctypes

def bytes_per_char(s):
    offset = 2 * ctypes.sizeof(ctypes.c_ssize_t) + 2 * ctypes.sizeof(ctypes.c_void_p)
    kind = ctypes.c_uint8.from_address(id(s) + offset).value >> 2 & 3
    size = {0: ctypes.sizeof(ctypes.c_wchar), 1: 1, 2: 2, 3: 4}
    return size[kind]

Gives:

>>> bytes_per_char('test')
1
>>> bytes_per_char('đžš')
2
>>> bytes_per_char('😀')
4

Note we had to handle the special case of kind == 0, because than the character type is exactly wchar_t (which is 16 or 32 bits, depending on the platform).

Upvotes: 1

How to find out internal string encoding?

Answers (3)

Related Questions