benjimin
benjimin

Reputation: 4890

How to use numpy arrays of strings in python 3?

Numpy's type strings (which specify the endianness if applicable, the kind of data, and the amount per item) include a "String" option 'S', for example, '|S20' or 'S20' represents a fixed-length 20-char (in the C sense, i.e., 20 bytes) data type.

Is this String ('S') type deprecated?

In python 2 it made sense to use this datatype for arrays of fixed-length python strings. In python 3, this numpy type now corresponds to python bytes objects, and an explicit encoding is necessary to translate this to python strings.

Is there any preferred way of storing python 3 strings in numpy arrays? How does the data type length now relate to the number of characters in the string? Does the Unicode-string type 'U' store a fixed number of characters, or does it vary depending on which characters are stored (i.e. on whether they have short encodings)? Is there a preferred way to convert numpy strings to python strings?

Upvotes: 1

Views: 5550

Answers (1)

Akaisteph7
Akaisteph7

Reputation: 6486

In Python 3, numpy just uses numpy.str_ to represent its strings as unicode. There is no need for you to worry about this. Just treat all your strings as str, since there is really not much of a difference to you. The preferred way to store the strings are in unicode format and the data type length corresponds to the longest string in your array. This length is fixed and so any change to the array will truncate longer strings to that fixed size. numpy will take care of doing all the conversions when necessary.

print(type(np.asarray(['abc','xyz'])[0]))
print(type(np.asarray(['abc','xyz']).tolist()[0]))
<class 'numpy.str_'>
<class 'str'>

Upvotes: 3

Related Questions