Andrew
Andrew

Reputation: 509

Best way to get length of numpy unicode string dtype

I am trying to determine the maximum element length of a numpy unicode array. For example, if I have:

# (dtypes added for clarity)
a = np.array(['a'], dtype='U5')
print(get_dtype_length(a))

I'd like it to print 5.

I can do something like:

def get_dtype_length(a):
  dtype = a.dtype
  dtype_string = dtype.descr[0][1]  # == '<U5'
  length = int(dtype_string[2:])
  return length

But that seems like a roundabout way of inferring something that must be available somewhere. Is there an attribute or numpy function that I haven't found to do this directly?

Clarification based on comments:

I am specifically looking for the maximum allowable length of any element in the array, not the length of any specific element (eg, not len(a[0]) == 1. The motivation behind this is that if I try to update a by something like a[0] = 'string_longer_than_dtype_of_a' I don't want the element to truncate to stri.

In numpy version 1.19 I believe np.can_cast(newVal.dtype, a.dtype, casting='safe') would be a valid test for my use case (as in 1.19 safe will also test if casting results in truncation), but it still doesn't actually solve the question of testing character size.

Upvotes: 2

Views: 1737

Answers (1)

Mad Physicist
Mad Physicist

Reputation: 114578

The 4 in U4 is the length of the string for each element, not the size of the character:

The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters.

From the docs.

The size of a single Unicode character can be a constant in your program:

 sizeof_numpy_unicode_char = np.dtype('U1').itemsize

You can then divide the total number of bytes per element by this constant to get buffer sizes, using either dtype.itemsize, or the shortcut ndarray.itemsize:

def get_length(a):
    return a.itemsize // sizeof_numpy_unicode_char

But the size of characters is indeed fixed (usually at 4 bytes).

Upvotes: 3

Related Questions