numpy trimming trailing zeros in byte strings

Question

I am trying to serialise spacy.io documents into byte strings and save them in a numpy array.

spacy has a to_bytes function which produces a bytearray. I call str on this bytearray and insert that string object into a numpy array. This works for most documents except those that end with a trailing zero byte.

To reproduce:

>>> import numpy as np
>>> b_arr = bytearray(b'\xca\x00\x00\x00n\xff\xff\xff\x19C\x98\xc9\x06\xb18{\xa5\xe0\xaf6\xe3\x9f\xa7\xad\x86\xd6\x8d\xc0\xe6Mo;{\x96xm\x80\xe5\x8c\x9f>> b_arr_text = str(b_arr)
>>> b_arr_np = np.asarray([b_arr_text], dtype=np.str)
>>> b_arr_text == b_arr_np[0]
Out[229]: False
>>> len(b_arr_text)
Out[230]: 206
>>> len(b_arr_np[0])
Out[231]: 205
>>> b_arr_np.dtype
Out[232]: dtype('S206')

The numpy string trims any trailing zeros, the dtype for the fixed length string is the same length as the input text however.

You can see this even from creating any bytestring with trailing zero bytes in an array:

>>> np.asarray(['\xca\x00\x00\x00'], dtype=np.str)

Out: array(['\xca'], dtype='|S4')

I presume numpy deems trailing zeros to be insignificant? However I can't deserialize these bytestrings back to a spacy document object.

Is there any way to get numpy not to trim the trailing zeros or do I have to stick to Python lists for this scenario?

Vadim Shkaberda · Accepted Answer

It's normal behavior. After b_arr_np.tostring() you can see, that all trailing zeros are in order.

b_arr = bytearray(b'\xca\x00\x00\x00')

b_arr_text = str(b_arr)

b_arr_np = np.asarray([b_arr_text], dtype=np.str)

b_arr_np
Out[303]: 
array(['\xca'], 
      dtype='|S4')

b_arr_np.tostring()
Out[304]: '\xca\x00\x00\x00'

Check post information loss with bytes type from github. Issues are either use trailng non-zero bytes or use dtype=uint8 with b_arr:

b_arr_np = np.asarray([b_arr], dtype=np.uint8)

b_arr_np
Out[319]: array([[202,   0,   0,   0]], dtype=uint8)

b_arr_np.tostring()

Out[320]: '\xca\x00\x00\x00'

numpy trimming trailing zeros in byte strings

Answers (2)

Related Questions